bytedance / Make-An-Audio-2

a text-conditional diffusion probabilistic model capable of generating high fidelity audio.
https://make-an-audio-2.github.io
MIT License
119 stars 14 forks source link

Reproduce results #3

Open MoayedHajiAli opened 3 months ago

MoayedHajiAli commented 3 months ago

Hello, I have tried generating the Audicaps test set using the provided script and the default options, then tested it with audioldm_eval. I am getting FD of 15.34 and IS of 9.58, and FAD of 1.27 which are significantly different from those reported in the paper.

I have noticed that you generate 4500 audios instead of the 900 in the test set (i.e you generate with the provided 5 ground truth captions). Do I have to use anything else for as the target dataset or the default 900 audios are enough?

On another note, I noticed that the results reported in the paper and those reported in AudioLDM-2 (which they claim that they were provided by you) are different. I am wondering what is the difference in the two evaluation protocols as it was not mentioned in the paper?

Thank you for your help!

Darius-H commented 3 months ago

Hello, I have tried generating the Audicaps test set using the provided script and the default options, then tested it with audioldm_eval. I am getting FD of 15.34 and IS of 9.58, and FAD of 1.27 which are significantly different from those reported in the paper.

I have noticed that you generate 4500 audios instead of the 900 in the test set (i.e you generate with the provided 5 ground truth captions). Do I have to use anything else for as the target dataset or the default 900 audios are enough?

On another note, I noticed that the results reported in the paper and those reported in AudioLDM-2 (which they claim that they were provided by you) are different. I am wondering what is the difference in the two evaluation protocols as it was not mentioned in the paper?

Thank you for your help!

The results reported in the paper are the result did not use caption drop to improve classfier-free performance. Later, we found that dropping caption with probability 0.1 in training significantly improves the results (currently published checkpoint), but we have not updated the paper. The results reported in AudioLDM-2 are also the results before training with caption drop.

There is a slight difference in the results of generating 900 videos compared to generating 4500 videos, and in our experiments, we found that the effect on the FAD is within 0.2, we continued the method of make-an-audio to test 4500 audios.

MoayedHajiAli commented 2 months ago

Hello @Darius-H Thank you very much for your reply for all of my issues. I really appreciate your support. 1- I missed what is exactly the difference between the results reported in the paper (before the caption drop) and the AudioLDM-2 paper (also before the caption drop). As in your paper, you report FAD: 1.80, while in AudioLDM-2 paper, you report FAD: 2.05 2- Thank you for sharing the differences between the reported results and checkpoint. It is very interesting. In the available checkpoints, I got the following numbers on 900 videos: FD:15.34, IS:9.58, FAD:1.27, which we included in our recent preprint. Basically, it seems that the checkpoint with the caption drop has much better FAD score but drops the FD and IS. I will mention these differences in the next paper update but if you think I must have made an issues when calculating the number, please let me know. Thank you again for your great help!