ValueError: Matrix Contains Invalid Numeric Entries During Model Finetuning

lucgeo commented 3 months ago

Hi,

I encountered an error while trying to finetune a given model using my own data. The error appears to be related to the linear sum assignment in the pit_loss_multispk function. I observed that during the first epoch, at a certain point, the cost_mx matrix starts to contain "nan" instead of numeric values.

Below are the steps I followed and the complete error traceback. My python version is 3.7.12, I have created a conda environment and I installed the dependencies as in readme.md.

Executed command: python3 diaper/train.py -c examples/finetune_adaptedmorespeakers_myowndata.yaml

Output:

python3 diaper/train.py -c examples/finetune_adaptedmorespeakers_myowndata.yaml
pre_crossattention 66048
latent_attractors 16384
encoder_attractors 1788672
latents2attractors 1280
counter 129
frame_encoder 2465920
Total trainable parameters:             4338433
/home/user/apps/DiaPer/diaper/common_utils/features.py:173: FutureWarning: Pass sr=16000, n_fft=512, n_mels=40 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  mel_basis = librosa.filters.mel(sampling_rate, n_fft, feature_dim)
Traceback (most recent call last):
  File "diaper/train.py", line 562, in <module>
    spkids, acum_train_metrics, args)
  File "diaper/train.py", line 99, in compute_loss_and_metrics
    args
  File "/home/user/apps/DiaPer/diaper/backend/losses.py", line 283, in get_loss
    logits_padded, ts_padded, attractors_logits, n_speakers, args)
  File "/home/user/apps/DiaPer/diaper/backend/losses.py", line 126, in pit_loss_multispk
    pred_alig, ref_alig = linear_sum_assignment(cost_mx.to("cpu"))
  File "/home/user/miniconda3/envs/DiaPer-conda/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment
    return _lsap_module.calculate_assignment(cost_matrix)
ValueError: matrix contains invalid numeric entries

Any guidance or suggestions to resolve this issue would be greatly appreciated. Thank you!

fnlandini commented 3 months ago

Hi @lucgeo Thanks for trying out the code. The NaNs are caught by the loss function because it is the first operation trying to use them but they normally appear in the forward or backward passes. However, to know where, one needs to debug more deeply. In the past, when developing the code I noticed that I would obtain NaNs in certain extreme cases like too few or very many latents. I assume that you took this configuration file and replaced the fields that were not defined so those parameters should not be a problem. Did you change any other parameters? Also, what kind of data are you trying to use? Hours, number of files, number of speakers, sampling rate? Also, which model are you trying to fine-tune? I cannot guarantee that answers to those questions will allow us to know exactly what is going on but they might help. Also, if you look at the tensorboard plots, does it fail immediately or after a few epochs or updates? For the purpose of debugging, I recommend you set log_report_batches_num: 1 It will be slower but it will save data for each update.

lucgeo commented 3 months ago

Hi @fnlandini ,

Thanks for your reply! I want to fine-tune the SC_LibriSpeech_2spk_adapted1-10_finetuneAliMeetingFarmix model, so I'm using this config file where I replaced only the undefined fields, without changing the rest of the parameters. I'm using 16 KHz mono WAV files, totaling approximately 15 hours, with a maximum of 4-5 speakers per file. My train_data_dir and valid_data_dir contain the following files in Kaldi format: wav.scp, rttm, segments, spk2utt, utt2spk, and reco2dur (I consulted this script to create them).

The process fails immediately, and I only get this file (checkpoint_0.tar) in the output directory.

fnlandini commented 3 months ago

Hi @lucgeo , The characteristics you mention seem correct. Perhaps you already tried this, but to validate there is no error with the data, maybe you can try to fine-tune from the checkpoints in https://github.com/BUTSpeechFIT/DiaPer/tree/main/models/10attractors/SC_LibriSpeech_2spk_adapted1-10/models using the same train.yaml you have but just changing the path to the models. It it fails as well, perhaps there is some issue with the data directory. Also, did you try running inference on your data with some of the models in the repository? That could also help to know if the data directory is correct. Besides this, if it fails immediately, that can make the debugging easier. You can try to hook the debugger right before the forward call and then check where it fails. This will take longer time but probably lead to better understanding of what is going on.

fnlandini commented 1 month ago

Closing due to inactivity. Feel free to reopen if you see fit.

BUTSpeechFIT / DiaPer

ValueError: Matrix Contains Invalid Numeric Entries During Model Finetuning #5