Open 848150331 opened 2 years ago
每次都复现吗?看看调小一下batchsize
每次都复现吗?看看调小一下batchsize
是的,每次都复现,偶尔400多步的时候就会出现;3090 batchsize设置96,现在设置成12测试
96有点玄乎,你先试试
96有点玄乎,你先试试
{| Epoch: 234/13750 (4/4) | Loss: 0.2344 | 0.43 steps/s | Step: 105k | }
{| Epoch: 235/13750 (4/4) | Loss: 0.2343 | 0.43 steps/s | Step: 105k | }
{| Epoch: 236/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | }
{| Epoch: 237/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | }
Traceback (most recent call last):
File "synthesizer_train.py", line 37, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
我也是训练一段时间后报错,bs是96,调成32也一样,显存22g
96有点玄乎,你先试试
{| Epoch: 234/13750 (4/4) | Loss: 0.2344 | 0.43 steps/s | Step: 105k | } {| Epoch: 235/13750 (4/4) | Loss: 0.2343 | 0.43 steps/s | Step: 105k | } {| Epoch: 236/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } {| Epoch: 237/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } Traceback (most recent call last): File "synthesizer_train.py", line 37, in train(**vars(args)) File "C:\MachineLearning_Data\MockingBird-main\synthesizer\train.py", line 209, in train loss.backward() File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch\autogradinit.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
我也是训练一段时间后报错,bs是96,调成32也一样,显存22g
你这个报错我之前也有出现过,后来发现是因为预处理和数据集不对造成的,你可以检查一下,预处理和数据集,或者重新预处理一下,看看还会不会报错,我现在不报这个错误了,开始报主帖上的错误
96有点玄乎,你先试试
{| Epoch: 234/13750 (4/4) | Loss: 0.2344 | 0.43 steps/s | Step: 105k | } {| Epoch: 235/13750 (4/4) | Loss: 0.2343 | 0.43 steps/s | Step: 105k | } {| Epoch: 236/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } {| Epoch: 237/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } Traceback (most recent call last): File "synthesizer_train.py", line 37, in train(**vars(args)) File "C:\MachineLearning_Data\MockingBird-main\synthesizer\train.py", line 209, in train loss.backward() File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch\autogradinit.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
我也是训练一段时间后报错,bs是96,调成32也一样,显存22g你这个报错我之前也有出现过,后来发现是因为预处理和数据集不对造成的,你可以检查一下,预处理和数据集,或者重新预处理一下,看看还会不会报错,我现在不报这个错误了,开始报主帖上的错误
我尼玛刚说完就报了一个和你一模一样的错误,你有毒吧而且还会感染的那种-.-
96有点玄乎,你先试试
{| Epoch: 234/13750 (4/4) | Loss: 0.2344 | 0.43 steps/s | Step: 105k | } {| Epoch: 235/13750 (4/4) | Loss: 0.2343 | 0.43 steps/s | Step: 105k | } {| Epoch: 236/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } {| Epoch: 237/13750 (4/4) | Loss: 0.2342 | 0.43 steps/s | Step: 105k | } Traceback (most recent call last): File "synthesizer_train.py", line 37, in train(**vars(args)) File "C:\MachineLearning_Data\MockingBird-main\synthesizer\train.py", line 209, in train loss.backward() File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "d:\app\Anaconda3\envs\faceswap\lib\site-packages\torch\autogradinit.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
我也是训练一段时间后报错,bs是96,调成32也一样,显存22g你这个报错我之前也有出现过,后来发现是因为预处理和数据集不对造成的,你可以检查一下,预处理和数据集,或者重新预处理一下,看看还会不会报错,我现在不报这个错误了,开始报主帖上的错误
我尼玛刚说完就报了一个和你一模一样的错误,你有毒吧而且还会感染的那种-.-
前天去了天河城,昨天核酸检测,今天阴性问题不大
经常训练到1200步左右的时候就报错:
Traceback (most recent call last): File "synthesizer_train.py", line 37, in
train(**vars(args))
File "E:\MockingBird-main\synthesizer\train.py", line 180, in train
for i, (texts, mels, embeds, idx) in enumerate(data_loader, 1):
File "E:\python3.7\lib\site-packages\torch\utils\data\dataloader.py", line 352, in iter
return self._get_iterator()
File "E:\python3.7\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "E:\python3.7\lib\site-packages\torch\utils\data\dataloader.py", line 801, in init
w.start()
File "E:\python3.7\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "E:\python3.7\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "E:\python3.7\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "E:\python3.7\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "E:\python3.7\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "E:\python3.7\lib\multiprocessing\synchronize.py", line 104, in getstate
h = context.get_spawning_popen().duplicate_for_child(sl.handle)
File "E:\python3.7\lib\multiprocessing\popen_spawn_win32.py", line 95, in duplicate_for_child
return reduction.duplicate(handle, self.sentinel)
File "E:\python3.7\lib\multiprocessing\reduction.py", line 77, in duplicate
0, inheritable, _winapi.DUPLICATE_SAME_ACCESS)
PermissionError: [WinError 5] 拒绝访问。
PS E:\MockingBird-main> Traceback (most recent call last):
File "", line 1, in
File "E:\python3.7\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "E:\python3.7\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input