RuntimeError: stack expects a non-empty TensorList

dillfrescott commented 1 year ago

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Total 23 examples, average length 28.4123410326087 seconds.
Total 2 examples, average length 22.654843749999998 seconds.
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['quantizer.weight_proj.weight', 'project_q.bias', 'quantizer.codevectors', 'project_hid.bias', 'project_q.weight', 'quantizer.weight_proj.bias', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['quantizer.weight_proj.weight', 'project_q.bias', 'quantizer.codevectors', 'project_hid.bias', 'project_q.weight', 'quantizer.weight_proj.bias', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2Model: ['quantizer.weight_proj.weight', 'project_q.bias', 'quantizer.codevectors', 'project_hid.bias', 'project_q.weight', 'quantizer.weight_proj.bias', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Removing weight norm...
Missing logger folder: ./logs/RV
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name            | Type              | Params
-------------------------------------------------------
0  | RVEncoder       | Speech2Vector     | 40.6 M
1  | dp              | DurationPredictor | 230 K 
2  | f0p             | F0Predictor       | 1.3 M 
3  | vp              | F0Predictor       | 230 K 
4  | Ep              | EPredictor        | 643 K 
5  | u2m             | Unit2Mel          | 38.9 M
6  | embedding       | Embedding         | 102 K 
7  | D               | Discriminator     | 2.4 M 
8  | duration_length | LengthRegulator   | 0     
9  | gan_loss        | GANLoss           | 0     
10 | bce_loss        | BCEWithLogitsLoss | 0     
11 | vocoder         | Vocoder           | 13.9 M
-------------------------------------------------------
80.2 M    Trainable params
18.1 M    Non-trainable params
98.3 M    Total params
393.221   Total estimated model params size (MB)
Epoch 4: : 22it [00:09,  2.26it/s, v_num=0, train/dur=0.195, train/mel=0.944, train/f0=0.130, train/voiced=0.229, train/E_loss=0.296, train/G=0.211, train/D=0.529]
Validation: 0it [00:00, ?it/s]
Validation:   0% 0/1 [00:00<?, ?it/s]
Validation DataLoader 0:   0% 0/1 [00:00<?, ?it/s]
Epoch 4: : 23it [00:12,  1.80it/s, v_num=0, train/dur=0.195, train/mel=0.944, train/f0=0.130, train/voiced=0.229, train/E_loss=0.296, train/G=0.211, train/D=0.529]
                                                          tcmalloc: large alloc 1336705024 bytes == 0x55b3bfd52000 @  0x7fc336fdd615 0x55b2ea1afad7 0x55b2ea2193db 0x55b2ea219293 0x55b2ea1839cf 0x55b2ea196292 0x7fc315c785fe 0x7fc2edee3ee5 0x7fc2edede447 0x7fc2edee5569 0x7fc315c8b0bb 0x7fc31588b6af 0x55b2ea193f8e 0x55b2ea17c651 0x55b2ea193d0d 0x55b2ea177fec 0x55b2ea172b66 0x55b2ea1843fc 0x55b2ea173a9b 0x55b2ea172451 0x55b2ea1843fc 0x55b2ea177fec 0x55b2ea184366 0x55b2ea173a9b 0x55b2ea172451 0x55b2ea19396c 0x55b2ea1747fa 0x55b2ea172451 0x55b2ea19396c 0x55b2ea1747fa 0x55b2ea172451
tcmalloc: large alloc 1336705024 bytes == 0x55b3bfd52000 @  0x7fc336fdd615 0x55b2ea1afad7 0x55b2ea2193db 0x55b2ea219293 0x55b2ea1839cf 0x55b2ea196292 0x7fc315c785fe 0x7fc2edee3ee5 0x7fc2edede447 0x7fc2edee5569 0x7fc315c8b0bb 0x7fc31588b6af 0x55b2ea193f8e 0x55b2ea17c651 0x55b2ea193d0d 0x55b2ea177fec 0x55b2ea172b66 0x55b2ea1843fc 0x55b2ea173a9b 0x55b2ea172451 0x55b2ea1843fc 0x55b2ea177fec 0x55b2ea184366 0x55b2ea173a9b 0x55b2ea172451 0x55b2ea19396c 0x55b2ea1747fa 0x55b2ea172451 0x55b2ea19396c 0x55b2ea1747fa 0x55b2ea172451
Epoch 8: : 4it [00:03,  1.11it/s, v_num=0, train/dur=0.142, train/mel=1.060, train/f0=0.095, train/voiced=0.548, train/E_loss=0.247, train/G=0.276, train/D=0.493] Traceback (most recent call last):
  File "train.py", line 123, in <module>
    wrapper.fit(model)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 174, in advance
    batch = next(data_fetcher)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 256, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 277, in _fetch_next_batch
    batch = next(iterator)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 557, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 569, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1313, in _next_data
    return self._process_data(data)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 5.
Original Traceback (most recent call last):
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/envs/diff/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/content/drive/MyDrive/UUVC-master/data/dataset.py", line 157, in seqCollate
    output[k] = torch.stack(output[k])
RuntimeError: stack expects a non-empty TensorList

dillfrescott commented 1 year ago

Happened twice now at different spots

b04901014 commented 1 year ago

Hi, I'm not sure how to help you debug this as you are working on a different dataset and configuration. It looks like you have only 23 training samples with about 30 seconds per utterance.

The error message suggests that there is something wrong when processing input data. It may also be some bug that only happens when you have small amount of data.

I suspect the sampler may be an issue. Since you have only 23 samples, bucketing no longer helps and may be an issue. You can try replacing https://github.com/b04901014/UUVC/blob/master/trainer.py#L33 by something like:

def train_dataloader(self):
    dataset = data.DataLoader(self.traindata,
                                  num_workers=self.hp.nworkers,
                                  batch_size=1,
                                  collate_fn=self.traindata.seqCollate)
    return dataset

to disable bucketing, and see if that subdues the error.

dillfrescott commented 1 year ago

Thank you so much! One other question:

How would I resume training from a ckpt?

b04901014 commented 1 year ago

You can do so by passing with the argument of --resume_checkpoint or --pretrained_path. pretrained_path only loads the model weights and trains from scratch, and resume_checkpoint restores all the training states (e.g., optimizer parameters, number of epochs)

dillfrescott commented 1 year ago

Thank you so much again! I'm gonna close this issue because the code you provided seems to have fixed the issue (I'm at 15 epochs so far and it hasn't errored out)

b04901014 / UUVC

RuntimeError: stack expects a non-empty TensorList #2