Open cromz22 opened 2 months ago
@cromz22 Hi Shuichiro, I am also using this repo and facing the same problem. I am wondering that have you managed to work out a way for obtaining discrete units from pretrained DinoSR model after posting this issue? I will be very grateful for any help : )
Hi, I have no progress on this since this report. As stated above, I believe the argmax values above are the discrete units, but I can't be sure.
I read through the code carefully, and I believe you are right. The discrete units should be the argmax values of negative distances between layer outputs and codebooks.
Thank you for opensourcing this amazing work!
Do you have any plans for releasing the evaluation scripts?
I would like to reproduce the results provided in the tables in the paper, but it seems that enough details are not provided. For example,
How can we obtain discrete units from pretrained DinoSR model? I believe the
argmax
call in this line would produce them, but because thisforward
function doesn't seem to be targeted for evaluation, I'm not sure if the arguments given to the function is OK as it is. https://github.com/Alexander-H-Liu/dinosr/blob/5a38d5e3c11b9cd3379741e7fe7a0b68feb25036/models/dinosr.py#L630The definition of the 5th layer seems unclear. Is it the 5th layer in 12 layers of Transformer or in the top 8 layers that was used for DinoSR?
What kind of forced alignment method was used? If Montreal Forced Aligner was used, which acoustic/dictionary models were used?