Closed lucgeo closed 1 month ago
Hi @lucgeo
Thanks for trying out the code. The NaNs are caught by the loss function because it is the first operation trying to use them but they normally appear in the forward or backward passes. However, to know where, one needs to debug more deeply. In the past, when developing the code I noticed that I would obtain NaNs in certain extreme cases like too few or very many latents.
I assume that you took this configuration file and replaced the fields that were not defined so those parameters should not be a problem. Did you change any other parameters?
Also, what kind of data are you trying to use? Hours, number of files, number of speakers, sampling rate? Also, which model are you trying to fine-tune?
I cannot guarantee that answers to those questions will allow us to know exactly what is going on but they might help. Also, if you look at the tensorboard plots, does it fail immediately or after a few epochs or updates? For the purpose of debugging, I recommend you set log_report_batches_num: 1
It will be slower but it will save data for each update.
Hi @fnlandini ,
Thanks for your reply! I want to fine-tune the SC_LibriSpeech_2spk_adapted1-10_finetuneAliMeetingFarmix model, so I'm using this config file where I replaced only the undefined fields, without changing the rest of the parameters. I'm using 16 KHz mono WAV files, totaling approximately 15 hours, with a maximum of 4-5 speakers per file. My train_data_dir and valid_data_dir contain the following files in Kaldi format: wav.scp, rttm, segments, spk2utt, utt2spk, and reco2dur (I consulted this script to create them).
The process fails immediately, and I only get this file (checkpoint_0.tar) in the output directory.
Hi @lucgeo , The characteristics you mention seem correct. Perhaps you already tried this, but to validate there is no error with the data, maybe you can try to fine-tune from the checkpoints in https://github.com/BUTSpeechFIT/DiaPer/tree/main/models/10attractors/SC_LibriSpeech_2spk_adapted1-10/models using the same train.yaml you have but just changing the path to the models. It it fails as well, perhaps there is some issue with the data directory. Also, did you try running inference on your data with some of the models in the repository? That could also help to know if the data directory is correct. Besides this, if it fails immediately, that can make the debugging easier. You can try to hook the debugger right before the forward call and then check where it fails. This will take longer time but probably lead to better understanding of what is going on.
Closing due to inactivity. Feel free to reopen if you see fit.
Hi,
I encountered an error while trying to finetune a given model using my own data. The error appears to be related to the linear sum assignment in the
pit_loss_multispk
function. I observed that during the first epoch, at a certain point, the cost_mx matrix starts to contain "nan" instead of numeric values.Below are the steps I followed and the complete error traceback. My python version is 3.7.12, I have created a conda environment and I installed the dependencies as in readme.md.
Executed command:
python3 diaper/train.py -c examples/finetune_adaptedmorespeakers_myowndata.yaml
Output:
Any guidance or suggestions to resolve this issue would be greatly appreciated. Thank you!