BUTSpeechFIT / DiaPer

MIT License
41 stars 3 forks source link

training problem #8

Open leo67867 opened 2 weeks ago

leo67867 commented 2 weeks ago

Hello, I am processing and training the AISHELL-4 dataset using the command:

python diaper/train.py -c DiaPer/models/10attractors/SC_LibriSpeech_2spk_adapted1-10_finetuneAISHELL4mix/train.yaml,

where I modified init_epochs: 0 and init_model_path: ''. I split the AISHELL-4 dataset into a training and validation set with an 8:2 ratio. After training, I tested my model on the test set, but the results were different from yours. For example, for L_R003S01C02, I got a DER of 72.57, whereas yours was 47.37. For M_R003S01C01, my result was 65.53, but yours was 34.28. Also, I got my best results around the 10th epoch, whereas you got yours around the 190th-200th epoch.

Could you please share the specific details of your data splitting or processing methods? Do you have any suggestions on what might be going wrong with my approach and how I can improve it? Thank you.

fnlandini commented 2 weeks ago

Hi @leo67867 I guess you are choosing the epoch based on the loss on development set, correct? I am attaching the tensorboard.zip logs. You can see that in my case also the dev loss reached the lowest in just a few epochs. However, if you look at the dev_DER, you will see that it can still improve further. I observed this in a few of the fine-tuning steps with different sets. The attractor existence loss eventually grows but the BCE activation one improves on the dev set. Have you tried evaluating your model for epoch 200? I expect you will get results more similar to mine. You can also try using the whole train set (instead of 80% of it) and run for 200 epochs as I did. In this case, since I used the test set as validation, I did not choose the number of epochs very carefully and just picked 200 as a reasonable guess. My other question is if you are mixing the channels of AISHELL-4 to obtain the waveforms. That is what I did and I did not try using a single channel, for example. I am not sure if using a single channel might cause large degradation.

I hope this helps. Federico

leo67867 commented 3 days ago

Hello,thank you for your response. I followed your advice and used the entire training set, running it for 200 epochs and using the test set for validation. The results were slightly better than when I used 80% of the training set, but the performance is still significantly lower compared to your results. As you suggested, I converted the AISHELL-4 audio to mono-channel as input data because I encountered an error when trying to use the original 8-channel AISHELL-4 audio. This error prevented the code from running properly. The error details are as follows: python DiaPer/diaper/train.py -c DiaPer/models/10attractors/SC_LibriSpeech_2spk_adapted1-10_finetuneAISHELL4mix/train.yaml pre_crossattention 66048 latent_attractors 16384 encoder_attractors 1788672 latents2attractors 1280 counter 129 frame_encoder 2465920 Total trainable parameters: 4338433 miniconda3/envs/DiaPer/lib/python3.7/site-packages/librosa/util/decorators.py:88: UserWarning: n_fft=512 is too small for input signal of length=8 return f(*args, **kwargs) Warning: ('20200616_M_R001S01C01', 0, 1500) is empty: (0, 257, 240000) Traceback (most recent call last): File "DiaPer/diaper/train.py", line 521, in train_loader, dev_loader = get_training_dataloaders(args) File "DiaPer/diaper/train.py", line 284, in get_training_dataloaders Ytrain, , , , , = train_set.getitem(0) File "DiaPer/diaper/common_utils/diarization_dataset.py", line 131, in getitem raise ValueError(f"Encountered an empty sequence at index {i}, and no saved sequence is available.") ValueError: Encountered an empty sequence at index 0, and no saved sequence is available.

Are you able to handle multi-channel audio data? I’m not sure where the issue might be, or how I should proceed. I would greatly appreciate any guidance or suggestions you could provide.Thank you very much for your assistance!

fnlandini commented 3 days ago

Hi @leo67867 Unfortunately, the code does not support multi-channel input. If you mixed the channels to obtain a mono file, then that is the same as I did. My suggestion is that you compare the tensorboard I shared and the one of your training. Perhaps that will give a hint of what could be different. Otherwise, I am sorry but I am not sure what could be different. In case it is useful, I am attaching the data folders for train and test in case it is useful for you to spot any difference.
AISHELL4_data.tar.gz