Evaluation script - Githubissues

cromz22 commented 2 months ago

Thank you for opensourcing this amazing work!

Do you have any plans for releasing the evaluation scripts?

I would like to reproduce the results provided in the tables in the paper, but it seems that enough details are not provided. For example,

How can we obtain discrete units from pretrained DinoSR model? I believe the argmax call in this line would produce them, but because this forward function doesn't seem to be targeted for evaluation, I'm not sure if the arguments given to the function is OK as it is. https://github.com/Alexander-H-Liu/dinosr/blob/5a38d5e3c11b9cd3379741e7fe7a0b68feb25036/models/dinosr.py#L630
The definition of the 5th layer seems unclear. Is it the 5th layer in 12 layers of Transformer or in the top 8 layers that was used for DinoSR?

We focused on the fifth layer of DinoSR
What kind of forced alignment method was used? If Montreal Forced Aligner was used, which acoustic/dictionary models were used?

To compute these metrics, forced alignment is used to acquire the ground truth phone of each feature frame on LibriSpeech dev-clean and dev-other sets

cantabile-kwok commented 2 weeks ago

@cromz22 Hi Shuichiro, I am also using this repo and facing the same problem. I am wondering that have you managed to work out a way for obtaining discrete units from pretrained DinoSR model after posting this issue? I will be very grateful for any help : )

cromz22 commented 2 weeks ago

Hi, I have no progress on this since this report. As stated above, I believe the argmax values above are the discrete units, but I can't be sure.

cantabile-kwok commented 2 weeks ago

I read through the code carefully, and I believe you are right. The discrete units should be the argmax values of negative distances between layer outputs and codebooks.

Alexander-H-Liu / dinosr

Evaluation script #3