Open ttjensen opened 6 months ago
It looks like I was maybe able to get a little bit more information after I do a keyboardInterrupt?
Keyboard interruption in main thread... closing server.
[Training] [2024-05-08T15:48:46.635992] Disabled distributed training.
[Training] [2024-05-08T15:48:46.635992] Path already exists. Rename it to [./training\threeDog\finetune_archived_240508-154055]
[Training] [2024-05-08T15:48:46.636971] Loading from ./models/tortoise/dvae.pth
[Training] [2024-05-08T15:48:46.636971] Traceback (most recent call last):
[Training] [2024-05-08T15:48:46.636971] File "D:\Projects\threeDog\ai-voice-cloning\src\train.py", line 72, in <module>
[Training] [2024-05-08T15:48:46.636971] train(config_path, args.launcher)
[Training] [2024-05-08T15:48:46.636971] File "D:\Projects\threeDog\ai-voice-cloning\src\train.py", line 39, in train
[Training] [2024-05-08T15:48:46.637968] trainer.do_training()
[Training] [2024-05-08T15:48:46.637968] File "D:\Projects\threeDog\ai-voice-cloning\modules\dlas\dlas\train.py", line 408, in do_training
[Training] [2024-05-08T15:48:46.638975] metric = self.do_step(train_data)
[Training] [2024-05-08T15:48:46.638975] ^^^^^^^^^^^^^^^^^^^^^^^^
[Training] [2024-05-08T15:48:46.638975] File "D:\Projects\threeDog\ai-voice-cloning\modules\dlas\dlas\train.py", line 271, in do_step
[Training] [2024-05-08T15:48:46.639963] gradient_norms_dict = self.model.optimize_parameters(
[Training] [2024-05-08T15:48:46.639963] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Training] [2024-05-08T15:48:46.639963] File "D:\Projects\threeDog\ai-voice-cloning\modules\dlas\dlas\trainer\ExtensibleTrainer.py", line 321, in optimize_parameters
[Training] [2024-05-08T15:48:46.639963] ns = step.do_forward_backward(
[Training] [2024-05-08T15:48:46.640960] ^^^^^^^^^^^^^^^^^^^^^^^^^
[Training] [2024-05-08T15:48:46.640960] File "D:\Projects\threeDog\ai-voice-cloning\modules\dlas\dlas\trainer\steps.py", line 322, in do_forward_backward
[Training] [2024-05-08T15:48:46.640960] self.scaler.scale(total_loss).backward()
[Training] [2024-05-08T15:48:46.640960] File "D:\Projects\threeDog\ai-voice-cloning\venv\Lib\site-packages\torch\_tensor.py", line 525, in backward
[Training] [2024-05-08T15:48:46.641962] torch.autograd.backward(
[Training] [2024-05-08T15:48:46.641962] File "D:\Projects\threeDog\ai-voice-cloning\venv\Lib\site-packages\torch\autograd\__init__.py", line 267, in backward
[Training] [2024-05-08T15:48:46.641962] _engine_run_backward(
[Training] [2024-05-08T15:48:46.642968] File "D:\Projects\threeDog\ai-voice-cloning\venv\Lib\site-packages\torch\autograd\graph.py", line 744, in _engine_run_backward
[Training] [2024-05-08T15:48:46.642968] return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[Training] [2024-05-08T15:48:46.642968] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Training] [2024-05-08T15:48:46.642968] KeyboardInterrupt
For me there is a significant pause between this warnings.warn() and the training showing in the ui. It eventually does appear but I sometimes have to wait for up to 5 minutes...
I've been able to successful prepare my dataset and configuration, but no matter the configuration settings I change, I always get stuck right near the beginning of training. There is no activity in the UI once I get to this point, and no further activity in the console. I also don't believe I'm getting the full error message here, it simply ends with
[Training] [2024-05-08T15:41:17.382663] warnings.warn(
For anyone who encountered a similar issue, were you able to get past it?