Open Aafiya-H opened 10 months ago
Hi,
I would recommend using this github implementation to evaluate the model. Also, for clotho dataset, for one audio there are 5 text labels. Thus, the metric calculation is a bit different. Please refer to our implementation of evaluation in here: https://github.com/LAION-AI/CLAP/blob/main/src/laion_clap/training/train.py#L577
@Aafiya-H were you able to reproduce the results with the github repo?
Hi, I tried replicating the audio to text retrieval results using the PyPI library and the hugging face implementation, however the obtained numbers do not match with those reported in the paper. For the hugging face implementation, I use
ClapTextModelWithProjection
andClapAudioModelWithProjection
. I obtain the similarity score by performing cosine similarity and sort the retrieved texts by similarity score. Similarly for the PyPI library implementation, I useget_audio_embedding_from_data
andget_text_embedding
and follow the same procedure as above. The model is initialized as following:I am using Clotho version 2.1 evaluation split from here and AudioCaps val split from google drive link in repository Could you please help me understand what could be the issue?