CUDA bfloat16 problem - Githubissues

A-2-H commented 8 months ago

2023-10-23 10:49:30,409 WARNING: logs/HiFiSVC doesn't exist yet!
Global seed set to 594461
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/HiFiSVC
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                     | Params
-----------------------------------------------------------
0 | generator     | HiFiSinger               | 14.9 M
1 | mpd           | MultiPeriodDiscriminator | 57.5 M
2 | msd           | MultiScaleDiscriminator  | 29.6 M
3 | mel_transform | MelSpectrogram           | 0     
-----------------------------------------------------------
102 M     Trainable params
0         Non-trainable params
102 M     Total params
408.124   Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0% 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/fish-diffusion/tools/hifisinger/train.py", line 83, in <module>
    trainer.fit(model, train_loader, valid_loader, ckpt_path=args.resume)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in validation_step
    with self.precision_plugin.val_step_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 170, in val_step_context
    with self.forward_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 118, in forward_context
    with self.autocast_context_manager():
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 113, in autocast_context_manager
    return torch.autocast(self.device, dtype=torch.bfloat16 if self.precision == "bf16-mixed" else torch.half)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in __init__
    raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Google Colab using T4 gpu and any other gpu - still the same error when I tried to train my model

li-henan commented 2 months ago

hi friend, I occur the same error, have you solved it, thank you sincerely if you can help

A-2-H commented 2 months ago

hi friend, I occur the same error, have you solved it, thank you sincerely if you can help

It's not fixed yet but I found workaround. After you make environment and clone git of Fish-Diff into your colab you have to change config file in fish because google colab doesn't use bfloat16 so it has to be changed (for now as a quick solution). Here is the file directory: /content/fish-diffusion/configs/base/trainers/base.py

in this file change line 18 from 'precision="bf16-mixed",' to 'precision="16-mixed",' and save it. It should work now.

li-henan commented 2 months ago

thank you very much for your help, it works now. Look forward to discussing with you about the effects of the model Sincerely yours!

li-henan commented 2 months ago

dear friend, this code can fine-tune text encoder projection layer + diffusion or fine-tune hifigan, but have you fine-tuned contentvec using this code? such as use different layers' transformer in contentvec to cut size

thank you sincerely for your help

fishaudio / fish-diffusion

CUDA bfloat16 problem #122