Closed mya2152 closed 1 year ago
I inserted that exception in the training code mostly to prevent me from accidentally training without a pre-trained model; I haven't run into a situation yet where the exception is raised outside of that. I would suggest adding a traceback.print_exc above the line where the exception is raised to see what the actual problem is
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
strange because I did already run the hubert and f0 generation preprocess step
If you're doing this through a service like colab is it possible that the dataset folder might not be mounted correctly?
it turns out the hubert and f0 preprocess step isnt actually generating the ".f0.npy" files in the dataset folder anymore for some reason.
Edit: nevermind, i got it to generate successfully but getting the problem below.
When I commented out the raise exception line 120 in train.py it notified me the load checkpoint failed but then started working but only for a minute or so:
INFO:44k:Saving model and optimizer state at iteration 1 to ./logs/44k/G_0.pth
INFO:44k:Saving model and optimizer state at iteration 1 to ./logs/44k/D_0.pth
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
Traceback (most recent call last):
File "/notebooks/so-vits-svc/train.py", line 328, in
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/notebooks/so-vits-svc/train.py", line 137, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
File "/notebooks/so-vits-svc/train.py", line 159, in train_and_evaluate
for batch_idx, items in enumerate(train_loader):
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1326, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/usr/local/lib/python3.9/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 3.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in
If you're doing this through a service like colab is it possible that the dataset folder might not be mounted correctly?
I figured hopefully having only the "configs.json" and the "G_2100.pth" would be enough to resume training but am I missing something or would it just not work because I"m missing some sort of checkpoint file which I couldn't save earlier?
You need the D_* during training too
You need the D_* during training too
So D file has the be the exact same one that was produced with the G file it seems, I was under the impression you could use any D files and couple it with a G file to continue the pretraining
It's a GAN; the discriminator and generator (D and G) are trained at the same time.
I currently have only the G_2100.pth file and can produce vocals with inference with it and the config.json, however, if I want to train the model further I load the file into /logs/44k/ but when running the train.py script to train the model even further I keep getting an error telling me:
File "/notebooks/so-vits-svc/train.py", line 120, in run raise Exception("No pretrained model found") Exception: No pretrained model found
I did also try changing the name to "G_0.pth" similar to the original HF pretrained model but still get the message. Unfortunately the instance was shut down and I only got the generative "G" file saved as well as the configs.json, is it possible to continue training this?
Thanks in advance