bene-ges / nemo_compatible

useful things that work with NVIDIA NeMo library
Apache License 2.0
9 stars 1 forks source link

Learning leads to mistakes. Ubuntu #20

Closed Vubni closed 8 months ago

Vubni commented 8 months ago

Maybe I'm really making stupid mistakes, but I looked for solutions for some of the output and they all pointed to the need to change the script, but I doubt that Nemo has the wrong script, I think I'm making a mistake. I am trying to train a model on my 50 audio (I understand that it is very small), here is an approximate view of my marks.txt:

voice/00001.wav|Начинаю диагностику системы
voice/00002.wav|К сожалению его невозможно синтезировать
voice/00003.wav|Да, сэр
voice/00004.wav|К вашим услугам, сэр
voice/00005.wav|Да сэр
...

And here's the conclusion I end up with:

./train1.sh
./train1.sh: строка 1: #!/bin/bash: Нет такого файла или каталога
./train1.sh: строка 2: conda: команда не найдена
fatal: целевой путь «ru_g2p_ipa_bert_large» уже существует и не является пустым каталогом.
[NeMo W 2024-03-20 18:55:16 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-20 18:55:16.268871: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-20 18:55:16.815464: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-20 18:55:17 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo W 2024-03-20 18:55:18 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo I 2024-03-20 18:55:18 helpers:60] Restoring pretrained itn model from ru_g2p_ipa_bert_large/ru_g2p.nemo
[NeMo I 2024-03-20 18:55:19 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: DeepPavlov/rubert-base-cased, vocab_file: /tmp/tmpe08dl86p/c09b2638681e4862bdffa78433689e48_vocab.txt, merges_files: None, special_tokens_dict: {}, and use_fast: False
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2024-03-20 18:55:19 modelPT:251] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2024-03-20 18:55:19 nlp_overrides:454] Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/apex
    Megatron-based models require Apex to function correctly.
[NeMo W 2024-03-20 18:55:19 lm_utils:91] DeepPavlov/rubert-base-cased is not in get_pretrained_lm_models_list(include_external=False), will be using AutoModel from HuggingFace.
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[NeMo W 2024-03-20 18:55:22 modelPT:251] You tried to register an artifact under config key=language_model.config_file but an artifact for it has already been registered.
[NeMo I 2024-03-20 18:55:23 save_restore_connector:249] Model ThutmoseTaggerModel was successfully restored from /home/egor/synthesys/ru_g2p_ipa_bert_large/ru_g2p.nemo.
[NeMo I 2024-03-20 18:55:23 helpers:81] Model itn -- Device cuda:0
[NeMo I 2024-03-20 18:55:23 normalization_as_tagging_infer:59] Running inference on all_words.txt...
[NeMo I 2024-03-20 18:55:24 normalization_as_tagging_infer:87] Predictions saved to all_words.g2p.txt.
cannot read line:   7 4 <DELETE> <DELETE>   <DELETE> <DELETE>   PLAIN PLAIN

[NeMo W 2024-03-20 18:55:30 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo W 2024-03-20 18:55:31 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-20 18:55:31.574381: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-20 18:55:32.170724: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-20 18:55:32 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo I 2024-03-20 18:55:33 dataset:228] Loading dataset from manifest.json.
0it [00:00, ?it/s][NeMo W 2024-03-20 18:55:33 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [7]. Symbol will be skipped.
[NeMo W 2024-03-20 18:55:33 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [4]. Symbol will be skipped.
49it [00:00, 14375.11it/s]
[NeMo I 2024-03-20 18:55:33 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-20 18:55:33 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-20 18:55:33 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-20 18:55:33 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
Processing manifest.json:
  0%|                                                                                                                                                                                       | 0/49 [00:00<?, ?it/s]
Error executing job with overrides: ['manifest_filepath=manifest.json', 'sup_data_path=sup_data', '++dataloader_params.num_workers=1']
Traceback (most recent call last):
  File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 79, in main
    CFG_NAME2FUNC[cfg.name](dataloader)
  File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 33, in preprocess_ds_for_fastpitch_align
    for batch in tqdm(dataloader, total=len(dataloader)):
  File "/home/egor/.local/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/egor/.local/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/egor/.local/lib/python3.8/site-packages/nemo/collections/tts/data/dataset.py", line 623, in __getitem__
    mel_len = self.get_log_mel(audio).shape[2]
  File "/home/egor/.local/lib/python3.8/site-packages/nemo/collections/tts/data/dataset.py", line 504, in get_log_mel
    spec = self.get_spec(audio)
  File "/home/egor/.local/lib/python3.8/site-packages/nemo/collections/tts/data/dataset.py", line 496, in get_spec
    spec = self.stft(audio)
  File "/home/egor/.local/lib/python3.8/site-packages/nemo/collections/tts/data/dataset.py", line 309, in <lambda>
    self.stft = lambda x: torch.stft(
  File "/home/egor/.local/lib/python3.8/site-packages/torch/functional.py", line 658, in stft
    input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 4495, in pad
    return torch._C._nn.pad(input, pad, mode, value)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 2 of input [1, 38068, 2]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-20 18:55:40 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-20 18:55:40.270552: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-20 18:55:40.854651: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-20 18:55:41 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo W 2024-03-20 18:55:42 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo W 2024-03-20 18:55:42 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/lightning_fabric/connector.py:554: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
      rank_zero_warn(

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-03-20 18:55:42 exp_manager:386] Experiments will be logged at experiments/FastPitch/2024-03-20_18-55-42
[NeMo I 2024-03-20 18:55:42 exp_manager:825] TensorboardLogger has been set up
[NeMo I 2024-03-20 18:55:42 dataset:228] Loading dataset from train_manifest.json.
0it [00:00, ?it/s][NeMo W 2024-03-20 18:55:42 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [7]. Symbol will be skipped.
[NeMo W 2024-03-20 18:55:42 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [4]. Symbol will be skipped.
49it [00:00, 14667.49it/s]
[NeMo I 2024-03-20 18:55:42 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-20 18:55:42 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-20 18:55:42 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-20 18:55:42 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-20 18:55:42 dataset:228] Loading dataset from val_manifest.json.
0it [00:00, ?it/s][NeMo W 2024-03-20 18:55:42 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [7]. Symbol will be skipped.
[NeMo W 2024-03-20 18:55:42 tts_tokenizers:158] Text: [skɐnʲ`irəvənʲɪje mɐkʲ`etə stark `ɛkspə 74 zəvʲɪrʂɨn`o, sɛr] contains unknown char: [4]. Symbol will be skipped.
49it [00:00, 14612.22it/s]
[NeMo I 2024-03-20 18:55:42 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-20 18:55:42 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-20 18:55:42 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-20 18:55:42 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-20 18:55:42 features:289] PADDING: 1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Error executing job with overrides: ['model.train_ds.dataloader_params.batch_size=16', 'model.validation_ds.dataloader_params.batch_size=16', 'train_dataset=train_manifest.json', 'validation_datasets=val_manifest.json', 'sup_data_path=sup_data', 'exp_manager.exp_dir=experiments', 'trainer.devices=1', 'trainer.max_epochs=2000', 'trainer.check_val_every_n_epoch=50', 'pitch_mean=120.88', 'pitch_std=44.0', 'exp_manager.resume_if_exists=false']
Traceback (most recent call last):
  File "NeMo/examples/tts/fastpitch.py", line 31, in main
    trainer.fit(model)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
    self.__setup_profiler()
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-20 18:55:48 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-20 18:55:49.010691: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-20 18:55:49.602382: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-20 18:55:50 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

usage: generate_mels.py [-h] --fastpitch-model-ckpt FASTPITCH_MODEL_CKPT --input-json-manifests INPUT_JSON_MANIFESTS [INPUT_JSON_MANIFESTS ...] --output-json-manifest-root OUTPUT_JSON_MANIFEST_ROOT
                        [--num-workers NUM_WORKERS] [--cpu]
generate_mels.py: error: argument --fastpitch-model-ckpt: expected one argument
[NeMo W 2024-03-20 18:55:56 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-20 18:55:56.874678: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-20 18:55:57.454459: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-20 18:55:57 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

Primary config directory not found.
Check that the config directory '/home/egor/synthesys/NeMo/examples/tts/NeMo/examples/tts/conf/hifigan' exists and readable

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

For training I use a video card 1 rtx 3060, as a consequence I changed the value from trainer.devices=8 to trainer.devices=1 in train.sh Cuda: 11.8 Ubuntu: 22.04.04 Python: 3.8 Also, the audios are definitely in the specified path, so there can't be a problem with that.

bene-ges commented 8 months ago

I think the main error is here RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 2 of input [1, 38068, 2]

Can you check that all your wavs have 1 channel and not more? If they are not in mono, you can recode them

Vubni commented 8 months ago

@bene-ges This helped me get rid of this error, but some still remained. According to the advice on the Internet, I changed the number of 'num_workers', but it didn't help

/train1.sh: строка 1: #!/bin/bash: Нет такого файла или каталога
./train1.sh: строка 2: conda: команда не найдена
fatal: целевой путь «ru_g2p_ipa_bert_large» уже существует и не является пустым каталогом.
[NeMo W 2024-03-21 12:53:09 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-21 12:53:09.983990: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:10.555393: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:11 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo W 2024-03-21 12:53:12 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo I 2024-03-21 12:53:12 helpers:60] Restoring pretrained itn model from ru_g2p_ipa_bert_large/ru_g2p.nemo
[NeMo I 2024-03-21 12:53:13 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: DeepPavlov/rubert-base-cased, vocab_file: /tmp/tmpkb0i_ffd/c09b2638681e4862bdffa78433689e48_vocab.txt, merges_files: None, special_tokens_dict: {}, and use_fast: False
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2024-03-21 12:53:14 modelPT:251] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2024-03-21 12:53:14 nlp_overrides:454] Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/apex
    Megatron-based models require Apex to function correctly.
[NeMo W 2024-03-21 12:53:14 lm_utils:91] DeepPavlov/rubert-base-cased is not in get_pretrained_lm_models_list(include_external=False), will be using AutoModel from HuggingFace.
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[NeMo W 2024-03-21 12:53:17 modelPT:251] You tried to register an artifact under config key=language_model.config_file but an artifact for it has already been registered.
[NeMo I 2024-03-21 12:53:17 save_restore_connector:249] Model ThutmoseTaggerModel was successfully restored from /home/egor/synthesys/ru_g2p_ipa_bert_large/ru_g2p.nemo.
[NeMo I 2024-03-21 12:53:17 helpers:81] Model itn -- Device cuda:0
[NeMo I 2024-03-21 12:53:17 normalization_as_tagging_infer:59] Running inference on all_words.txt...
[NeMo I 2024-03-21 12:53:18 normalization_as_tagging_infer:87] Predictions saved to all_words.g2p.txt.
[NeMo W 2024-03-21 12:53:24 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo W 2024-03-21 12:53:26 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-21 12:53:26.162592: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:26.737414: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:27 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo I 2024-03-21 12:53:27 dataset:228] Loading dataset from manifest.json.
49it [00:00, 15659.93it/s]
[NeMo I 2024-03-21 12:53:27 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:27 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:27 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:27 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
Processing manifest.json:
  0%|                                                    | 0/49 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
  0%|                                                    | 0/49 [00:01<?, ?it/s]
Error executing job with overrides: ['manifest_filepath=manifest.json', 'sup_data_path=sup_data', '++dataloader_params.num_workers=1']
Traceback (most recent call last):
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 5383) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 79, in main
    CFG_NAME2FUNC[cfg.name](dataloader)
  File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 33, in preprocess_ds_for_fastpitch_align
    for batch in tqdm(dataloader, total=len(dataloader)):
  File "/home/egor/.local/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
    success, data = self._try_get_data()
  File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 5383) exited unexpectedly

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-21 12:53:34 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-21 12:53:34.859368: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:35.435106: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:35 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

[NeMo W 2024-03-21 12:53:36 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

[NeMo W 2024-03-21 12:53:36 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/lightning_fabric/connector.py:554: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
      rank_zero_warn(

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-03-21 12:53:36 exp_manager:386] Experiments will be logged at experiments/FastPitch/2024-03-21_12-53-36
[NeMo I 2024-03-21 12:53:36 exp_manager:825] TensorboardLogger has been set up
[NeMo I 2024-03-21 12:53:36 dataset:228] Loading dataset from train_manifest.json.
49it [00:00, 15262.21it/s]
[NeMo I 2024-03-21 12:53:36 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:36 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:36 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:36 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 dataset:228] Loading dataset from val_manifest.json.
49it [00:00, 15091.86it/s]
[NeMo I 2024-03-21 12:53:37 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:37 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:37 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 features:289] PADDING: 1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Error executing job with overrides: ['model.train_ds.dataloader_params.batch_size=16', 'model.validation_ds.dataloader_params.batch_size=16', 'train_dataset=train_manifest.json', 'validation_datasets=val_manifest.json', 'sup_data_path=sup_data', 'exp_manager.exp_dir=experiments', 'trainer.devices=1', 'trainer.max_epochs=2000', 'trainer.check_val_every_n_epoch=50', 'pitch_mean=120.88', 'pitch_std=44.0', 'exp_manager.resume_if_exists=false']
Traceback (most recent call last):
  File "NeMo/examples/tts/fastpitch.py", line 31, in main
    trainer.fit(model)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
    self.__setup_profiler()
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-21 12:53:43 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-21 12:53:43.511446: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:44.075721: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:44 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

usage: generate_mels.py [-h] --fastpitch-model-ckpt FASTPITCH_MODEL_CKPT
                        --input-json-manifests INPUT_JSON_MANIFESTS
                        [INPUT_JSON_MANIFESTS ...] --output-json-manifest-root
                        OUTPUT_JSON_MANIFEST_ROOT [--num-workers NUM_WORKERS]
                        [--cpu]
generate_mels.py: error: argument --fastpitch-model-ckpt: expected one argument
[NeMo W 2024-03-21 12:53:51 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

2024-03-21 12:53:51.174198: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:51.737699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:52 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
      torch.utils._pytree._register_pytree_node(

Primary config directory not found.
Check that the config directory '/home/egor/synthesys/NeMo/examples/tts/NeMo/examples/tts/conf/hifigan' exists and readable

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
bene-ges commented 8 months ago

@Vubni I once had similar bug, it was connected with librosa version 0.10.1 (I don't know the reason) Try librosa==0.10.0

Also you can comment out parts of script that already worked. Now you have error on step extract_sup_data.py

Vubni commented 8 months ago

Everything is fine, thanks!