Closed leo67867 closed 3 days ago
Hi @leo67867 I guess you are choosing the epoch based on the loss on development set, correct? I am attaching the tensorboard.zip logs. You can see that in my case also the dev loss reached the lowest in just a few epochs. However, if you look at the dev_DER, you will see that it can still improve further. I observed this in a few of the fine-tuning steps with different sets. The attractor existence loss eventually grows but the BCE activation one improves on the dev set. Have you tried evaluating your model for epoch 200? I expect you will get results more similar to mine. You can also try using the whole train set (instead of 80% of it) and run for 200 epochs as I did. In this case, since I used the test set as validation, I did not choose the number of epochs very carefully and just picked 200 as a reasonable guess. My other question is if you are mixing the channels of AISHELL-4 to obtain the waveforms. That is what I did and I did not try using a single channel, for example. I am not sure if using a single channel might cause large degradation.
I hope this helps. Federico
Hello,thank you for your response. I followed your advice and used the entire training set, running it for 200 epochs and using the test set for validation. The results were slightly better than when I used 80% of the training set, but the performance is still significantly lower compared to your results. As you suggested, I converted the AISHELL-4 audio to mono-channel as input data because I encountered an error when trying to use the original 8-channel AISHELL-4 audio. This error prevented the code from running properly. The error details are as follows:
python DiaPer/diaper/train.py -c DiaPer/models/10attractors/SC_LibriSpeech_2spk_adapted1-10_finetuneAISHELL4mix/train.yaml
pre_crossattention 66048
latent_attractors 16384
encoder_attractors 1788672
latents2attractors 1280
counter 129
frame_encoder 2465920
Total trainable parameters: 4338433
miniconda3/envs/DiaPer/lib/python3.7/site-packages/librosa/util/decorators.py:88: UserWarning: n_fft=512 is too small for input signal of length=8
return f(*args, **kwargs)
Warning: ('20200616_M_R001S01C01', 0, 1500) is empty: (0, 257, 240000)
Traceback (most recent call last):
File "DiaPer/diaper/train.py", line 521, in
Are you able to handle multi-channel audio data? I’m not sure where the issue might be, or how I should proceed. I would greatly appreciate any guidance or suggestions you could provide.Thank you very much for your assistance!
Hi @leo67867
Unfortunately, the code does not support multi-channel input. If you mixed the channels to obtain a mono file, then that is the same as I did.
My suggestion is that you compare the tensorboard I shared and the one of your training. Perhaps that will give a hint of what could be different. Otherwise, I am sorry but I am not sure what could be different. In case it is useful, I am attaching the data folders for train and test in case it is useful for you to spot any difference.
AISHELL4_data.tar.gz
Closing due to inactivity
Hello, I am processing and training the AISHELL-4 dataset using the command:
python diaper/train.py -c DiaPer/models/10attractors/SC_LibriSpeech_2spk_adapted1-10_finetuneAISHELL4mix/train.yaml,
where I modified init_epochs: 0 and init_model_path: ''. I split the AISHELL-4 dataset into a training and validation set with an 8:2 ratio. After training, I tested my model on the test set, but the results were different from yours. For example, for L_R003S01C02, I got a DER of 72.57, whereas yours was 47.37. For M_R003S01C01, my result was 65.53, but yours was 34.28. Also, I got my best results around the 10th epoch, whereas you got yours around the 190th-200th epoch.
Could you please share the specific details of your data splitting or processing methods? Do you have any suggestions on what might be going wrong with my approach and how I can improve it? Thank you.