Continuing on an A100 does not work in Colab

Taikakim commented 4 months ago

Hi, if I understood correctly, to continue with the 16GB checkpoints the --ckpt-path is the right way to pass the weights. I tried resuming directly after training the base model for some hours, I only changed the LR, added two sample prompts and changed the warmup speed. But got this error:

The size of tensor a (14) must match the size of tensor b (12) at non-singleton dimension 0

The whole cell output

Found 158 files No module named 'flash_attn' flash_attn not installed, disabling Flash Attention /usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") wandb: Currently logged in as: . Use wandb login --relogin to force relogin wandb: wandb version 0.17.3 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.15.4 wandb: Run data is saved locally in ./wandb/run-20240701_144616-e6per9yr wandb: Run wandb offline to turn off syncing. wandb: Syncing run glad-firebrand-12 wandb: ⭐️ View project at wandb: 🚀 View run at **** wandb: logging graph, to disable use wandb.watch(log_graph=False) Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision Restoring states from the checkpoint path at /content/drive/MyDrive/StableAudioOpen/output/avp-test/rog673te/checkpoints/epoch=299-step=1200.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:345: The dirpath has changed from '/content/drive/MyDrive/StableAudioOpen/output/avp-test/rog673te/checkpoints' to '/content/drive/MyDrive/StableAudioOpen/output/avp-test/e6per9yr/checkpoints', therefore best_model_score, kth_best_model_path, kth_value, last_model_path and best_k_models won't be reloaded. Only best_model_path will be reloaded. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | diffusion | ConditionedDiffusionModelWrapper | 1.2 B 1 | diffusion_ema | EMA | 1.1 B 2 | losses | MultiLoss | 0

1.1 B Trainable params 1.2 B Non-trainable params 2.3 B Total params 9,080.665 Total estimated model params size (MB) Restored all states from the checkpoint at /content/drive/MyDrive/StableAudioOpen/output/avp-test/rog673te/checkpoints/epoch=299-step=1200.ckpt Epoch 299: 0% 0/4 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/init.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. warnings.warn( Generating demo Getting conditioning Generating demo for cfg scale 7

0% 0/200 [00:00<?, ?it/s] RuntimeError: The size of tensor a (14) must match the size of tensor b (12) at non-singleton dimension 0 Traceback (most recent call last): File "/content/stable-audio-tools/./train.py", line 128, in main() File "/content/stable-audio-tools/./train.py", line 125, in main trainer.fit(training_wrapper, train_dl, ckpt_path=args.ckpt_path if args.ckpt_path else None) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage self.fit_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/content/stable-audio-tools/stable_audio_tools/training/diffusion.py", line 559, in on_train_batch_end raise e File "/content/stable-audio-tools/stable_audio_tools/training/diffusion.py", line 532, in on_train_batch_end fakes = sample(model, noise, self.demo_steps, 0, cond_inputs, cfg_scale=cfg_scale, batch_cfg=True) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/content/stable-audio-tools/stable_audio_tools/inference/sampling.py", line 62, in sample v = model(x, ts * t[i], *extra_args).float() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/content/stable-audio-tools/stable_audio_tools/models/diffusion.py", line 532, in forward return self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/content/stable-audio-tools/stable_audio_tools/models/dit.py", line 334, in forward batch_output = self._forward( File "/content/stable-audio-tools/stable_audio_tools/models/dit.py", line 180, in _forward global_embed = global_embed + timestep_embed RuntimeError: The size of tensor a (14) must match the size of tensor b (12) at non-singleton dimension 0 Epoch 299: 0%| | 0/4 [01:14<?, ?it/s]

Taikakim commented 4 months ago

OK, I'm getting the same error now even when passing an unwrapped version of that trained model with the --pretrained-ckpt-path option. This is very curious as in the model config I only changed the LR.

Taikakim commented 4 months ago

OK, I think I found the reason... When I reverted the sample prompts to what they were in the initial run, the training will resume. This is quite unexpected though.

piiq commented 4 months ago

I was getting the same errors and this issue helped me figure out what's wrong.

You can't change the prompts in the demo section compared to the initial run. Once I reverted the prompts of the demo section in the model_config, the demo generation and training started successfully.

Stability-AI / stable-audio-tools

Continuing on an A100 does not work in Colab #114

| Name | Type | Params

0 | diffusion | ConditionedDiffusionModelWrapper | 1.2 B 1 | diffusion_ema | EMA | 1.1 B 2 | losses | MultiLoss | 0