BUTSpeechFIT / EEND

70 stars 10 forks source link

Train on AMI #7

Closed DTDwind closed 11 months ago

DTDwind commented 1 year ago

Hi,

Thank you for sharing this great project. I have been looking for a project for EEND-EDA, and your project is very helpful.

I tried to train my model on AMI, but I encountered some problems.

When I run it directly, I get the error:

File "/mnt/HDD/HDD2/DTDwind/EEND/EEND_EDA/eend/common_utils/features.py", line 72, in get_labeledSTFT
T[rel_start:rel_end, speaker_index] = 1
IndexError: index 2 is out of bounds for axis 1 with size 2

I infer that the number of speakers is wrong. I don't understand why EEND-EDA needs to pre-set the number of speakers, shouldn't it be dynamically obtained?

Therefore, I commented out the num_speakers in train.yaml, but I still got the error during training:

File "/mnt/HDD/HDD2/DTDwind/EEND/EEND_EDA/eend/common_utils/metrics.py", line 37, in calculate_metrics
    t_seq = torch.reshape(
RuntimeError: shape '[-1, 5]' is invalid for input of size 1324

I also tried to set the number of speakers to the maximum number of speakers in the AMI dataset, which is 5, but I still got the error:

 File "/mnt/HDD/HDD2/DTDwind/EEND/EEND_EDA/eend/common_utils/features.py", line 166, in transform
    Y = np.dot(Y ** 2, mel_basis.T)
  File "<__array_function__ internals>", line 200, in dot
ValueError: shapes (0, 129, 400000) and (129, 23) not aligned: 400000 (dim 2) != 129 (dim 0)

Do you think your team has done any experiments on AMI, and if so, can you help me solve these problems?

Thank you.

DTDwind commented 1 year ago

I found that my ES2010d.Mix-Headset.wav has two channels.

I'm not sure if this is a personal issue on my end.

When I set the number of speakers to 5 and the channels to 1, training can proceed normally.

However, I still don't understand why the EDA architecture requires setting the number of speakers in advance.

fnlandini commented 1 year ago

Hi, I'm glad you found the problem. Yes, there is one file that has two channels. You can take one of them or merge them (for example using sox).

As for your question, the architecture itself does not need to set the number of speakers. But there are a couple of places where it can become handy to set a maximum number.

One is when loading the data for training. When creating the labels, it is convenient to use enough dimensions but not more, mainly to avoid using more memory than needed. You could always created labels for 20 speakers but you'd be wasting a lot. The maximum limits this to only the amount needed. This could be automatically set but you'd need to read the data first and then generate the matrices which takes extra time.

The other is when running inference. You need to pass an input of certain size to the LSTM. So setting a maximum is needed. By default, it's 15 so the model decodes 15 attractors and then decides which ones are valid.

DTDwind commented 1 year ago

Hi, I attempted training the model on the AMI dataset for both 100 and 500 epochs, resulting in DERs of 89.93% and 91.87% respectively. This suggests that increasing the training duration does not significantly improve performance on the AMI dataset. However, the paper at https://arxiv.org/pdf/2106.10654.pdf reports an AMI DER of 15.8%, implying that the EEND-EDA framework has the potential to achieve excellent results on the AMI dataset.

Referring to the content of the mentioned paper, I understand that the correct training procedure involves first training a general model using simulated datasets and subsequently adapting it with 500 epochs. Is my interpretation accurate?

Nevertheless, I currently lack the requisite synthetic data, and training a general model can be time-intensive. Could you kindly provide the necessary checkpoint for this purpose?

fnlandini commented 1 year ago

Hi @DTDwind Sorry for the delay. Yes, you are right, one needs to train on synthetic data first and then only fine-tune on AMI. I have shared the checkpoints of a model trained on 2-speaker simulated conversations generated using LibriSpeech recordings in https://github.com/BUTSpeechFIT/EEND_dataprep/tree/main/v2/LibriSpeech/models That corresponds to one of the models used in MULTI-SPEAKER AND WIDE-BAND SIMULATED CONVERSATIONS AS TRAINING DATA FOR END-TO-END NEURAL DIARIZATION. However, I do not have such model trained for more-than-2 speakers. You can still try to fine-tune it on AMI but the performance will be worse than if you have an adaptation step to more speakers in between. Anyway, I hope this helps

DTDwind commented 1 year ago

Hi @fnlandini Thank you very much for your response. I will first try to use fine-tune AMI with this checkpoint, and then I may attempt to synthesize more speaker data using LibriSpeech for experiments. Your assistance has saved me a lot of training time.

DTDwind commented 1 year ago

Hi @fnlandini I tried using your model and encountered this error. Could you please help me take a look?

RuntimeError: Error(s) in loading state_dict for TransformerEDADiarization:
        size mismatch for enc.linear_in.weight: copying a param with shape torch.Size([256, 600]) from checkpoint, the shape in current model is torch.Size([256, 345]).

In paper say that 15 consecutive 23-dimensional log-scaled Mel-filterbanks (computed over 25 ms every 10 ms) are stacked to produce 345- dimensional features every 100ms. So why does it show torch.Size([256, 600])? I performed adaptation directly using adapt.yaml. Perhaps the training parameters of this model are different from those in the paper. Could you share the relevant training parameters?

fnlandini commented 1 year ago

Are you using parameters like those in lines 7-9 here? https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/models/infer.yaml This model is 16kHz so the feature-related parameters are different to the 8kHz models.

DTDwind commented 1 year ago

Hi @fnlandini

Thank you for your detailed explanation. I have successfully resolved the previous issue, but I have encountered a new problem.

File "/mnt/HDD/HDD2/DTDwind/EEND/EEND_EDA/eend/train.py", line 270, in <module>
    epoch, model, optimizer, _ = load_checkpoint(args, latest)
  File "/mnt/HDD/HDD2/DTDwind/EEND/EEND_EDA/eend/backend/models.py", line 470, in load_checkpoint
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  File "/home/DTDwind/.conda/envs/eend/lib/python3.9/site-packages/torch/optim/optimizer.py", line 138, in load_state_dict
    saved_groups = state_dict['param_groups']

I tried checking my checkpoint['optimizer_state_dict'] and found that the keys in your pretrained model do not match those in my checkpoint_0.tar. Specifically, they are dict_keys(['_step', 'warmup', 'model_size', '_rate']) and dict_keys(['state', 'param_groups']), respectively.

I apologize, I'm a beginner at this. At first, I suspected it might be a torch version problem, but I couldn't find an appropriate version. It appears that torch has consistently used param_groups. Therefore, I'm reaching out to you again for your guidance. Thank you.

fnlandini commented 1 year ago

Hi @DTDwind It looks like the missing parameters are from the noam optimizer (see updater.py). Perhaps you are trying to train with --optimizer set to adam instead of noam?

DTDwind commented 1 year ago

Thank you for the detailed explanation, it's running now. By the way, when directly inferring on the complete AMI dataset using SC-LibriSpeech, my result's DER is 53.64%.

fnlandini commented 11 months ago

Closing due to inactivity. Feel free to reopen if you see fit