capturable=False,报错

MarkIzhao commented 2 years ago

Win11 GPU：3060laptop
Python 3.9.13

+----------------+------------+---------------+------------------+ | Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) | +----------------+------------+---------------+------------------+ | 101k Steps | 16 | 3e-06 | 2 | +----------------+------------+---------------+------------------+

Could not load symbol cublasGetSmCountTarget from cublas64_11.dll. Error code 127 Traceback (most recent call last): File "G:\AIvioce\MockingBird\synthesizer_train.py", line 37, in train(vars(args)) File "G:\AIvioce\MockingBird\synthesizer\train.py", line 216, in train optimizer.step() File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\optimizer.py", line 109, in wrapper return func(*args, *kwargs) File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 157, in step adam(params_with_grad, File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 213, in adam func(params, File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

MarkIzhao commented 2 years ago

G:\Vioce\MockingBird>python synthesizer_train.py tjc G:\Vioce\tjc001\SV2TTS\synthesizer Arguments: run_id: tjc syn_dir: G:\Vioce\tjc001\SV2TTS\synthesizer models_dir: synthesizer/saved_models/ save_every: 1000 backup_every: 25000 log_every: 200 force_restart: False hparams:

Checkpoint path: synthesizer\saved_models\tjc\tjc.pt Loading training data from: G:\Vioce\tjc001\SV2TTS\synthesizer\train.txt Using model: Tacotron Using device: cuda

Initialising Tacotron Model...

\Loading the json with %s {'sample_rate': 16000, 'n_fft': 800, 'num_mels': 80, 'hop_size': 200, 'win_size': 800, 'fmin': 55, 'min_level_db': -100, 'ref_level_db': 20, 'max_abs_value': 4.0, 'preemphasis': 0.97, 'preemphasize': True, 'tts_embed_dims': 512, 'tts_encoder_dims': 256, 'tts_decoder_dims': 128, 'tts_postnet_dims': 512, 'tts_encoder_K': 5, 'tts_lstm_dims': 1024, 'tts_postnet_K': 5, 'tts_num_highways': 4, 'tts_dropout': 0.5, 'tts_cleaner_names': ['basic_cleaners'], 'tts_stop_threshold': -3.4, 'tts_schedule': [[2, 0.001, 10000, 12], [2, 0.0005, 15000, 12], [2, 0.0002, 20000, 12], [2, 0.0001, 30000, 12], [2, 5e-05, 40000, 12], [2, 1e-05, 60000, 12], [2, 5e-06, 160000, 12], [2, 3e-06, 320000, 12], [2, 1e-06, 640000, 12]], 'tts_clip_grad_norm': 1.0, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_finetune_layers': [], 'max_mel_frames': 900, 'rescale': True, 'rescaling_max': 0.9, 'synthesis_batch_size': 16, 'signal_normalization': True, 'power': 1.5, 'griffin_lim_iters': 60, 'fmax': 7600, 'allow_clipping_in_normalization': True, 'clip_mels_length': True, 'use_lws': False, 'symmetric_mels': True, 'trim_silence': True, 'speaker_embedding_size': 256, 'silence_min_duration_split': 0.4, 'utterance_min_duration': 1.6, 'use_gst': True, 'use_ser_for_gst': True} Trainable Parameters: 32.869M

Loading weights at synthesizer\saved_models\tjc\tjc.pt Tacotron weights loaded from step 219000 Using inputs from: G:\Vioce\tjc001\SV2TTS\synthesizer\train.txt G:\Vioce\tjc001\SV2TTS\synthesizer\mels G:\Vioce\tjc001\SV2TTS\synthesizer\embeds Found 872 samples +----------------+------------+---------------+------------------+ | Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) | +----------------+------------+---------------+------------------+ | 101k Steps | 12 | 3e-06 | 2 | +----------------+------------+---------------+------------------+

Traceback (most recent call last): File "G:\Vioce\MockingBird\synthesizer_train.py", line 37, in train(vars(args)) File "G:\Vioce\MockingBird\synthesizer\train.py", line 215, in train optimizer.step() File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\optimizer.py", line 109, in wrapper return func(*args, *kwargs) File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(args, kwargs) File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 157, in step adam(params_with_grad, File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 213, in adam func(params, File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\optim\adam.py", line 255, in _single_tensor_adam assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

itaybu commented 2 years ago

Hi got in the same problem

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

did you succeed to solve it?

MarkIzhao commented 2 years ago

卸载pytorch
pip uninstall torch
然后安装pytorch CUDA 11.6 可以解决Could not load symbol cublasGetSmCountTarget from cublas64_11.dll. Error code 127 但是在训练时间达到五分钟后关闭训练重启训练怎么能不报错：AssertionError: If capturable=False, state_steps should not be CUDA tensors.还没找到解决办法

MarkIzhao commented 2 years ago

Hi got in the same problem

assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors.

did you succeed to solve it?

CPU：AMD R7 5800H GPU：RTX3060laptop WIN11 按Ctrl+C 手动结束进程会损坏模型文件，导致报错 AssertionError: If capturable=False, state_steps should not be CUDA tensors.，非个例，正在寻找解决办法

lawrence124 commented 2 years ago

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors

这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors

这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

win 11, pytorch 1.9.0, cuda 11.1

停止训练 using CTRL+C, and resumed without problem

MarkIzhao commented 2 years ago

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

win 11, pytorch 1.9.0, cuda 11.1

停止训练 using CTRL+C, and resumed without problem

我的PyTorch是1.12
请问你的python是什么版本？我更换环境试一下

lawrence124 commented 2 years ago

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError: If capturable=False, state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

win 11, pytorch 1.9.0, cuda 11.1 停止训练 using CTRL+C, and resumed without problem

我的PyTorch是1.12 请问你的python是什么版本？我更换环境试一下

3.7.9, i remember some people mentioned they use newer version of python. I guess you can just try do downgrade the pytorch, as the readme.md mentioned:

PyTorch worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

one more thing, if u end up choosing 1.9.0, suggest u use cuda 11.1 instead of 10.2, i had problem/error/crash during training but solved after changing cuda to 11.1

MarkIzhao commented 2 years ago

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError： If capturable=False， state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError： If capturable=False， state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

win 11， pytorch 1.9.0， cuda 11.1 停止训练使用CTRL+C，并恢复没有问题

我的PyTorch是1.12 请问你的python是什么版本？我更换环境试一下

3.7.9，我记得有些人提到他们使用较新版本的python。我想你可以尝试降级pytorch，正如 readme.md 提到的：

PyTorch适用于pytorch，在1.9.0版本（最近于2021年8月）中进行了测试，GPU Tesla T4和GTX 2060

还有一件事，如果你最终选择了1.9.0，建议你使用cuda 11.1而不是10.2，我在训练期间遇到了问题/错误/崩溃，但在将cuda更改为11.1后解决了

非常感谢，python3.9.13 pytorch1.9 cuda11.1 Ctrl+C停止训练后确实可以继续训练没有报错但是step从1.1/S降低到0.8/S
我会继续尝试更换版本

MarkIzhao commented 2 years ago

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError： If capturable=False， state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

执行训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 停止训练CTRL+C或CTRL+fn+B 然后再次开始训练python synthesizer_train.py mandarin /SV2TTS/synthesizer 报错AssertionError： If capturable=False， state_steps should not be CUDA tensors 这个项目不支持像Deepfacelab那样可以暂停或者多次训练的吗？邮件知乎 B站均尝试过联系作者无果希望作者能早日看到并答疑

win 11， pytorch 1.9.0， cuda 11.1 停止训练使用CTRL+C，并恢复没有问题

我的PyTorch是1.12 请问你的python是什么版本？我更换环境试一下

3.7.9，我记得有些人提到他们使用较新版本的python。我想你可以尝试降级pytorch，正如 readme.md 提到的： PyTorch适用于pytorch，在1.9.0版本（最近于2021年8月）中进行了测试，GPU Tesla T4和GTX 2060 还有一件事，如果你最终选择了1.9.0，建议你使用cuda 11.1而不是10.2，我在训练期间遇到了问题/错误/崩溃，但在将cuda更改为11.1后解决了

非常感谢，python3.9.13 pytorch1.9 cuda11.1 Ctrl+C停止训练后确实可以继续训练没有报错但是step从1.1/S降低到0.8/S 我会继续尝试更换版本 WIN 11 CPU：R7 5800H GPU：3060laptop Python3.9.13 torch-1.10.2+cu113-cp39-cp39-win_amd64 无报错正常继续训练 Steps 1/S
该问题确认解决

jaried commented 2 years ago

有人这么说的：

Hi, I am also facing the same issue when I try to load the checkpoint and resume model training on the latest pytorch (1.12).

It seems to be related with a newly introduced parameter (capturable) for the Adam and AdamW optimizers. Currently two workarounds:

forcing capturable = True after loading the checkpoint (as suggested above) optim.param_groups[0]['capturable'] = True . This seems to slow down the model training by approx. 10% (YMMV depending on the setup).

Reverting pytorch back to previous versions (I have been using 1.11.0).

I'm wondering whether enforcing capturable = True may incur unwanted side effects.

我也担心 captureable=True是否会带来不必要的副作用，所以我也准备回退到torch1.11.

原问题在这里： https://github.com/pytorch/pytorch/issues/80809

babysor / MockingBird

capturable=False,报错 #631