Open Draegon366 opened 1 year ago
Describe the bug
I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute. Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing. I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset. It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.
Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.
To Reproduce
- Run
CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py
on google compute instance- Wait several seconds
- Error.
Expected behavior
Runs the training script with processing split between the GPUs.
Logs
Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit self._fit() File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit self.train_epoch() File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch for cur_step, batch in enumerate(self.train_loader): File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ data = self._next_data() File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise raise exception TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__ item = self.samples[idx] TypeError: list indices must be integers or slices, not list
Environment
{ "CUDA": { "GPU": [ "Tesla T4", "Tesla T4", "Tesla T4", "Tesla T4" ], "available": true, "version": "11.7" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "2.0.1+cu117", "Trainer": "v0.0.27", "numpy": "1.23.5" }, "System": { "OS": "Linux", "architecture": [ "64bit", "" ], "processor": "", "python": "3.10.10", "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)" } }
Additional context
No response
Hello, did you find a way to deal with it? Faced same troubles) They thought, for a reason, not to use spawn in torch DDP, maybe that is a problem?
Describe the bug
I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute. Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing. I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset. It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.
Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.
To Reproduce
CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py
on google compute instanceExpected behavior
Runs the training script with processing split between the GPUs.
Logs
Environment
Additional context
No response