Closed Vubni closed 8 months ago
I think the main error is here
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 2 of input [1, 38068, 2]
Can you check that all your wavs have 1 channel and not more? If they are not in mono, you can recode them
@bene-ges This helped me get rid of this error, but some still remained. According to the advice on the Internet, I changed the number of 'num_workers', but it didn't help
/train1.sh: строка 1: #!/bin/bash: Нет такого файла или каталога
./train1.sh: строка 2: conda: команда не найдена
fatal: целевой путь «ru_g2p_ipa_bert_large» уже существует и не является пустым каталогом.
[NeMo W 2024-03-21 12:53:09 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-03-21 12:53:09.983990: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:10.555393: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:11 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[NeMo W 2024-03-21 12:53:12 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo I 2024-03-21 12:53:12 helpers:60] Restoring pretrained itn model from ru_g2p_ipa_bert_large/ru_g2p.nemo
[NeMo I 2024-03-21 12:53:13 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: DeepPavlov/rubert-base-cased, vocab_file: /tmp/tmpkb0i_ffd/c09b2638681e4862bdffa78433689e48_vocab.txt, merges_files: None, special_tokens_dict: {}, and use_fast: False
Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2024-03-21 12:53:14 modelPT:251] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2024-03-21 12:53:14 nlp_overrides:454] Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/apex
Megatron-based models require Apex to function correctly.
[NeMo W 2024-03-21 12:53:14 lm_utils:91] DeepPavlov/rubert-base-cased is not in get_pretrained_lm_models_list(include_external=False), will be using AutoModel from HuggingFace.
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[NeMo W 2024-03-21 12:53:17 modelPT:251] You tried to register an artifact under config key=language_model.config_file but an artifact for it has already been registered.
[NeMo I 2024-03-21 12:53:17 save_restore_connector:249] Model ThutmoseTaggerModel was successfully restored from /home/egor/synthesys/ru_g2p_ipa_bert_large/ru_g2p.nemo.
[NeMo I 2024-03-21 12:53:17 helpers:81] Model itn -- Device cuda:0
[NeMo I 2024-03-21 12:53:17 normalization_as_tagging_infer:59] Running inference on all_words.txt...
[NeMo I 2024-03-21 12:53:18 normalization_as_tagging_infer:87] Predictions saved to all_words.g2p.txt.
[NeMo W 2024-03-21 12:53:24 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[NeMo W 2024-03-21 12:53:26 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-03-21 12:53:26.162592: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:26.737414: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:27 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[NeMo I 2024-03-21 12:53:27 dataset:228] Loading dataset from manifest.json.
49it [00:00, 15659.93it/s]
[NeMo I 2024-03-21 12:53:27 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:27 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:27 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:27 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
Processing manifest.json:
0%| | 0/49 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
0%| | 0/49 [00:01<?, ?it/s]
Error executing job with overrides: ['manifest_filepath=manifest.json', 'sup_data_path=sup_data', '++dataloader_params.num_workers=1']
Traceback (most recent call last):
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 5383) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 79, in main
CFG_NAME2FUNC[cfg.name](dataloader)
File "NeMo/scripts/dataset_processing/tts/extract_sup_data.py", line 33, in preprocess_ds_for_fastpitch_align
for batch in tqdm(dataloader, total=len(dataloader)):
File "/home/egor/.local/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/home/egor/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 5383) exited unexpectedly
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-21 12:53:34 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-03-21 12:53:34.859368: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:35.435106: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:35 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[NeMo W 2024-03-21 12:53:36 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[NeMo W 2024-03-21 12:53:36 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/lightning_fabric/connector.py:554: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-03-21 12:53:36 exp_manager:386] Experiments will be logged at experiments/FastPitch/2024-03-21_12-53-36
[NeMo I 2024-03-21 12:53:36 exp_manager:825] TensorboardLogger has been set up
[NeMo I 2024-03-21 12:53:36 dataset:228] Loading dataset from train_manifest.json.
49it [00:00, 15262.21it/s]
[NeMo I 2024-03-21 12:53:36 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:36 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:36 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:36 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 dataset:228] Loading dataset from val_manifest.json.
49it [00:00, 15091.86it/s]
[NeMo I 2024-03-21 12:53:37 dataset:266] Loaded dataset with 49 files.
[NeMo I 2024-03-21 12:53:37 dataset:268] Dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 dataset:376] Pruned 0 files. Final dataset contains 49 files
[NeMo I 2024-03-21 12:53:37 dataset:378] Pruned 0.00 hours. Final dataset contains 0.03 hours.
[NeMo I 2024-03-21 12:53:37 features:289] PADDING: 1
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Error executing job with overrides: ['model.train_ds.dataloader_params.batch_size=16', 'model.validation_ds.dataloader_params.batch_size=16', 'train_dataset=train_manifest.json', 'validation_datasets=val_manifest.json', 'sup_data_path=sup_data', 'exp_manager.exp_dir=experiments', 'trainer.devices=1', 'trainer.max_epochs=2000', 'trainer.check_val_every_n_epoch=50', 'pitch_mean=120.88', 'pitch_std=44.0', 'exp_manager.resume_if_exists=false']
Traceback (most recent call last):
File "NeMo/examples/tts/fastpitch.py", line 31, in main
trainer.fit(model)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
self.__setup_profiler()
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/home/egor/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/egor/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[NeMo W 2024-03-21 12:53:43 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-03-21 12:53:43.511446: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:44.075721: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:44 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
usage: generate_mels.py [-h] --fastpitch-model-ckpt FASTPITCH_MODEL_CKPT
--input-json-manifests INPUT_JSON_MANIFESTS
[INPUT_JSON_MANIFESTS ...] --output-json-manifest-root
OUTPUT_JSON_MANIFEST_ROOT [--num-workers NUM_WORKERS]
[--cpu]
generate_mels.py: error: argument --fastpitch-model-ckpt: expected one argument
[NeMo W 2024-03-21 12:53:51 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-03-21 12:53:51.174198: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-21 12:53:51.737699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[NeMo W 2024-03-21 12:53:52 nemo_logging:349] /home/egor/.local/lib/python3.8/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Primary config directory not found.
Check that the config directory '/home/egor/synthesys/NeMo/examples/tts/NeMo/examples/tts/conf/hifigan' exists and readable
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
@Vubni I once had similar bug, it was connected with librosa version 0.10.1 (I don't know the reason) Try librosa==0.10.0
Also you can comment out parts of script that already worked. Now you have error on step extract_sup_data.py
Everything is fine, thanks!
Maybe I'm really making stupid mistakes, but I looked for solutions for some of the output and they all pointed to the need to change the script, but I doubt that Nemo has the wrong script, I think I'm making a mistake. I am trying to train a model on my 50 audio (I understand that it is very small), here is an approximate view of my marks.txt:
And here's the conclusion I end up with:
For training I use a video card 1 rtx 3060, as a consequence I changed the value from trainer.devices=8 to trainer.devices=1 in train.sh Cuda: 11.8 Ubuntu: 22.04.04 Python: 3.8 Also, the audios are definitely in the specified path, so there can't be a problem with that.