coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.87k stars 4.39k forks source link

[Bug] ValueError: Cannot load file containing pickled data when allow_pickle=False #1929

Closed kin0303 closed 2 years ago

kin0303 commented 2 years ago

Describe the bug

I had training tacotron 2 for a while and now I want to add sample audio for one speaker. When I run using

CUDA_VISIBLE_DEVICES=0 python train.py --continue_path /media/DATA-2/TTS/TTS_Coqui/TTS/running-July-28-2022_09+54AM-68cef28a

I got error like this:

 > Number of output frames: 2

 > EPOCH: 0/1000
 --> /media/DATA-2/TTS/TTS_Coqui/TTS-July-28-2022_09+54AM-68cef28a

> DataLoader initialization
| > Tokenizer:
    | > add_blank: False
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: gruut
| > Number of instances : 23359
 | > Preprocessing samples
 | > Max text length: 239
 | > Min text length: 4
 | > Avg text length: 86.08806027655294
 | 
 | > Max audio length: 1145718.0
 | > Min audio length: 11868.0
 | > Avg audio length: 519904.13767712656
 | > Num. instances discarded samples: 0
 | > Batch group size: 0.

 > TRAINING (2022-09-01 11:28:31) 
/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/models/tacotron2.py:333: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  ) // self.decoder.r
/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/models/tacotron2.py:335: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  alignment_lengths = mel_lengths // self.decoder.r

   --> STEP: 9/5840 -- GLOBAL_STEP: 1690010
     | > decoder_loss: 1.35190  (2.06165)
     | > postnet_loss: 1.23185  (1.89519)
     | > stopnet_loss: 0.45206  (0.54466)
     | > decoder_coarse_loss: 1.96557  (2.80050)
     | > decoder_ddc_loss: 0.05431  (0.06398)
     | > ga_loss: 0.00554  (0.01036)
     | > decoder_diff_spec_loss: 0.46238  (0.58947)
     | > postnet_diff_spec_loss: 0.40906  (0.52605)
     | > decoder_ssim_loss: 0.48877  (0.48201)
     | > postnet_ssim_loss: 0.45778  (0.45322)
     | > loss: 2.08516  (2.81450)
     | > align_error: 0.38218  (0.36455)
     | > grad_norm: 11.03733  (13.36171)
     | > current_lr: 0.00000 
     | > step_time: 0.16360  (0.17053)
     | > loader_time: 0.00130  (0.00129)

   --> STEP: 19/5840 -- GLOBAL_STEP: 1690020
     | > decoder_loss: 1.26435  (2.00329)
     | > postnet_loss: 1.14596  (1.83944)
     | > stopnet_loss: 0.15051  (0.49044)
     | > decoder_coarse_loss: 1.96471  (2.79364)
     | > decoder_ddc_loss: 0.03852  (0.05443)
     | > ga_loss: 0.00158  (0.00696)
     | > decoder_diff_spec_loss: 0.44740  (0.57787)
     | > postnet_diff_spec_loss: 0.39480  (0.51306)
     | > decoder_ssim_loss: 0.43631  (0.47875)
     | > postnet_ssim_loss: 0.40454  (0.44884)
     | > loss: 1.68256  (2.70255)
     | > align_error: 0.32000  (0.36616)
     | > grad_norm: 6.11971  (12.52853)
     | > current_lr: 0.00000 
     | > step_time: 0.22500  (0.19586)
     | > loader_time: 0.00150  (0.00125)

 ! Run is kept in /media/DATA-2/TTS/TTS_Coqui/TTS-July-28-2022_09+54AM-68cef28a
Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1476, in _fit
    self.train_epoch()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1254, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 180, in __getitem__
    return self.load_data(idx)
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 230, in load_data
    token_ids = self.get_token_ids(idx, item["text"])
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 213, in get_token_ids
    token_ids = self.get_phonemes(idx, text)["token_ids"]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 196, in get_phonemes
    out_dict = self.phoneme_dataset[idx]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 563, in __getitem__
    ids = self.compute_or_load(item["audio_file"], item["text"])
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 579, in compute_or_load
    ids = np.load(cache_path)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/numpy/lib/npyio.py", line 445, in load
    raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False

Environment

{
"CUDA": {
"GPU": [
"NVIDIA GeForce GTX 1660 Ti"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu102",
"TTS": "0.6.1",
"numpy": "1.19.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.8.0",
"version": "#118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022"
}
}
p0p4k commented 2 years ago

I can suggest 2 fixes that you might try:

  1. Maybe move the cache folder temporary to different location and let it rebuild.
  2. add allow_pickle=True in np.load(cache_path), like np.load(cache_path , allow_pickle=True) at /media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 579
Edresson commented 2 years ago

@blackmamba1122 Looks like some of the phonemes cached are corrupted. You need to delete the cache directory or change the phonemes cache directory ("phoneme_cache_path" parameter on config) forcing the TTS to recompute it.

kin0303 commented 2 years ago

I can suggest 2 fixes that you might try:

  1. Maybe move the cache folder temporary to different location and let it rebuild.
  2. add allow_pickle=True in np.load(cache_path), like np.load(cache_path , allow_pickle=True) at /media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 579

I've try number 2, but I got error like this:

   --> STEP: 1209/5840 -- GLOBAL_STEP: 1691210
     | > decoder_loss: 0.56349  (0.80207)
     | > postnet_loss: 0.49640  (0.72210)
     | > stopnet_loss: 0.85311  (0.30274)
     | > decoder_coarse_loss: 0.88406  (1.23891)
     | > decoder_ddc_loss: 0.00170  (0.00865)
     | > ga_loss: 0.00004  (0.00041)
     | > decoder_diff_spec_loss: 0.36727  (0.39819)
     | > postnet_diff_spec_loss: 0.32913  (0.35275)
     | > decoder_ssim_loss: 0.12858  (0.25808)
     | > postnet_ssim_loss: 0.11792  (0.23859)
     | > loss: 1.57546  (1.30963)
     | > align_error: 0.60392  (0.42270)
     | > grad_norm: 1.56272  (3.97671)
     | > current_lr: 0.00000 
     | > step_time: 2.24570  (1.13703)
     | > loader_time: 0.00220  (0.00179)

 ! Run is kept in /media/DATA-2/TTS/TTS_Coqui/TTS-July-28-2022_09+54AM-68cef28a
Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1476, in _fit
    self.train_epoch()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1254, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 180, in __getitem__
    return self.load_data(idx)
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 230, in load_data
    token_ids = self.get_token_ids(idx, item["text"])
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 213, in get_token_ids
    token_ids = self.get_phonemes(idx, text)["token_ids"]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 198, in get_phonemes
    assert len(out_dict["token_ids"]) > 0
AssertionError
kin0303 commented 2 years ago

@blackmamba1122 Looks like some of the phonemes cached are corrupted. You need to delete the cache directory or change the phonemes cache directory ("phoneme_cache_path" parameter on config) forcing the TTS to recompute it.

I'll try this one and I report it back

kin0303 commented 2 years ago

@blackmamba1122 Looks like some of the phonemes cached are corrupted. You need to delete the cache directory or change the phonemes cache directory ("phoneme_cache_path" parameter on config) forcing the TTS to recompute it.

I'll try this one and I report it back

Still error

   --> STEP: 1209/5840 -- GLOBAL_STEP: 1691210
     | > decoder_loss: 0.58440  (0.80216)
     | > postnet_loss: 0.51266  (0.72178)
     | > stopnet_loss: 0.84992  (0.29996)
     | > decoder_coarse_loss: 0.89247  (1.24166)
     | > decoder_ddc_loss: 0.00162  (0.00863)
     | > ga_loss: 0.00004  (0.00034)
     | > decoder_diff_spec_loss: 0.37415  (0.39893)
     | > postnet_diff_spec_loss: 0.33291  (0.35312)
     | > decoder_ssim_loss: 0.12760  (0.25786)
     | > postnet_ssim_loss: 0.11682  (0.23833)
     | > loss: 1.58579  (1.30728)
     | > align_error: 0.60102  (0.41433)
     | > grad_norm: 4.16512  (4.11554)
     | > current_lr: 0.00000 
     | > step_time: 2.44260  (1.16248)
     | > loader_time: 0.00260  (0.00190)

 ! Run is kept in /media/DATA-2/TTS/TTS_Coqui/TTS-July-28-2022_09+54AM-68cef28a
Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1492, in fit
    self._fit()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1476, in _fit
    self.train_epoch()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/trainer/trainer.py", line 1254, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data
    return self._process_data(data)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/coqui_env/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 180, in __getitem__
    return self.load_data(idx)
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 230, in load_data
    token_ids = self.get_token_ids(idx, item["text"])
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 213, in get_token_ids
    token_ids = self.get_phonemes(idx, text)["token_ids"]
  File "/media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 198, in get_phonemes
    assert len(out_dict["token_ids"]) > 0
AssertionError
kin0303 commented 2 years ago

I am done with this problem. If you meet this problem you can do this way:

  1. Maybe move the cache folder temporary to different location and let it rebuild.
  2. add allow_pickle=True in np.load(cache_path), like np.load(cache_path , allow_pickle=True) at /media/DATA-2/TTS/TTS_Coqui/TTS/TTS/tts/datasets/dataset.py", line 579
  3. Or you can read these issues https://github.com/coqui-ai/TTS/issues/1624
dveni commented 1 year ago

Hi there,

I have to test it further, but in a multi-gpu setting some of the workers fail with the error ValueError: Cannot load file containing pickled data when allow_pickle=False. I don't know exactly why, but passing the argument allow_pickle=True in the np.load of the compute_or_load method of the PhonemeDataset class seems to fix the issue.

I think that this may be because np.save allows pickle by default, while the load funcion doesn't. Not sure why this is problematic only in the multi-gpu setting.

I'll post updates here but I'd propose to pass the argument allow_pickle=True to the load function since the phonemes cache is created by the library and there's not a big security risk. What do you think?