LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.42k stars 137 forks source link

Issue reproducing retrieval results using API and hugging face implementation #140

Open Aafiya-H opened 10 months ago

Aafiya-H commented 10 months ago

Hi, I tried replicating the audio to text retrieval results using the PyPI library and the hugging face implementation, however the obtained numbers do not match with those reported in the paper. For the hugging face implementation, I use ClapTextModelWithProjection and ClapAudioModelWithProjection. I obtain the similarity score by performing cosine similarity and sort the retrieved texts by similarity score. Similarly for the PyPI library implementation, I use get_audio_embedding_from_data and get_text_embedding and follow the same procedure as above. The model is initialized as following:

model = laion_clap.CLAP_Module(enable_fusion=enable_fusion)
model.load_ckpt() 

I am using Clotho version 2.1 evaluation split from here and AudioCaps val split from google drive link in repository Could you please help me understand what could be the issue?

lukewys commented 7 months ago

Hi,

I would recommend using this github implementation to evaluate the model. Also, for clotho dataset, for one audio there are 5 text labels. Thus, the metric calculation is a bit different. Please refer to our implementation of evaluation in here: https://github.com/LAION-AI/CLAP/blob/main/src/laion_clap/training/train.py#L577

carankt commented 1 month ago

@Aafiya-H were you able to reproduce the results with the github repo?