LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.36k stars 133 forks source link

Reproducing AudioCaps and Clotho resutls. #59

Closed assafmu closed 1 year ago

assafmu commented 1 year ago

Hey! I'm trying to reproduce the results for the HTSAT-RoBERTa setup, with AudioCaps and Clotho as the training sets. Results are good for Clotho-valid, and similar to the paper, but are much worse for AudioCaps-valid. T2A R@k of (7, 40, 54) instead of (36, 70, 83).

I'm using HTSAT-tiny, checkpoint downloaded from the official repo.

I've also tried using the CLAP published checkpoints, and adapting the eval_retrieval_freesound.sh script to run on Audiocaps and Clotho by passing the correct arguments. The script reported audiocaps valid T2A R@k of (8, 38, 57) instead of the (32, 68, 81) for this checkpoint. Results for Clotho valid are similar to those in the paper.

I'm not sure what could be the problem, since I'm using the existing checkpoint and the AudioCaps dataset preparation code from the LAION repo. Thanks in advance!

lukewys commented 1 year ago

Hi,

For training from scratch: could you please let us know what script you are using and how you make the dataset? The problem could be in the dataset creation part (e.g. the incorrect audiocaps dataset). Or, it could be the different hyperparameter. Batch size could be one of that.

For evaluating from the published checkpoint: it could be likely the problem of evaluation,

Evaluating Clotho and Audiocaps could be tricky, as they have five texts for each audio. TLDR: you count the best result if you are evaluating A2T, and the average result for T2A.

We adopted the evaluation code in https://github.com/XinhaoMei/audio-text_retrieval/blob/main/tools/utils.py#L74 . Our evaluation code can be found at: https://github.com/LAION-AI/CLAP/blob/main/src/training/train.py#L578 .

I am sorry for the confusion. To mitigate both part, I am planning to upload the webdataset version of the Clotho and Audiocaps, and I am planning to add a detailed description of the evaluation in the paper revision.

Please let me know for further questions.

Best regards, Yusong

assafmu commented 1 year ago

Thank you for the quick answer!

Evaluating existing checkpoints: I'm using the retrieval evaluation script, except using --datasetnames "audiocaps" "Clotho" and replacing --remotedata with appropriate local values. This script uses your evaluation code, which has special treatment for these datasets.

I've used this script to download and preprocess Clotho and Audiocaps.

Since the problem appears only in one dataset, and happens both in the existing checkpoints, and the model I trained locally, I think it's clear the problem is in the dataset preparation. Access to the webdataset versions of the data should probably solve it. As a side note, I've also asked in the discord server about webdataset versions of the datasets mentioned in the paper, but was informed you (LAION) are not redistributing them. This makes reproducing and continuing your work more challenging.

Training from scratch: I'm using the htsat-roberta script. I've replaced the --remotedata with local values, and otherwise did not change the script. I've tried both smaller, and larger, batch sizes, compared to the 96 (x8 GPUs) you've specified. Smaller batch sizes did hurt performance, but the difference was small, and larger batch sizes did not close the performance gap described above.

Thanks in advance, Assaf

lukewys commented 1 year ago

Good to see all the results! We are cautious about data distribution because we don't want to violate any copyright claims. I think I will upload those two academic dataset for reproduction purpose pose.

Thanks for the message.

assafmu @.***>于2023年1月10日 周二下午1:07写道:

Thank you for the quick answer!

Evaluating existing checkpoints: I'm using the retrieval evaluation script https://github.com/LAION-AI/CLAP/blob/main/experiment_scripts/eval_retrieval_freesound.sh, except using --datasetnames "audiocaps" "Clotho" and replacing --remotedata with appropriate local values. This script uses your evaluation code, which has special treatment for these datasets.

I've used this script https://github.com/LAION-AI/audio-dataset/blob/main/download_script/download_and_preprocess.sh to download and preprocess Clotho and Audiocaps.

Since the problem appears only in one dataset, and happens both in the existing checkpoints, and the model I trained locally, I think it's clear the problem is in the dataset preparation. Access to the webdataset versions of the data should probably solve it. As a side note, I've also asked in the discord server about webdataset versions of the datasets mentioned in the paper, but was informed you (LAION) are not redistributing them. This makes reproducing and continuing your work more challenging.

Training from scratch: I'm using the htsat-roberta script https://github.com/LAION-AI/CLAP/blob/main/experiment_scripts/train-htsat-roberta.sh . I've replaced the --remotedata with local values, and otherwise did not change the script. I've tried both smaller, and larger, batch sizes, compared to the 96 (x8 GPUs) you've specified. Smaller batch sizes did hurt performance, but the difference was small, and larger batch sizes did not close the performance gap described above.

Thanks in advance, Assaf

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/CLAP/issues/59#issuecomment-1377654362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGXJZ72QCDI6MUPIXR7MXMDWRWQPVANCNFSM6AAAAAATURNWPM . You are receiving this because you commented.Message ID: @.***>

lukewys commented 1 year ago

Hi @assafmu , we uploaded the webdataset format dataset for Clotho, pleas check: https://github.com/LAION-AI/CLAP/blob/main/README.md#reproducibility

assafmu commented 1 year ago

Hi, Thank you for the response! I'll try using the Clotho dataset, and the HTSAT checkpoint, and will report results. Using the provided Clotho dataset, and the HTSAT checkpoint, produced similar results on all metrics measured by the script.

However, results on the Clotho dataset were already similar to those reported in the paper. The difference I saw was in the AudioCaps dataset. I'm no expert on this, but maybe making the AudioCaps dataset (or others) only available on request for other academics could solve the copyright issue? Thanks in advance, Assaf

lukewys commented 1 year ago

Hi,

Unfortunately, as far as I know, hosting AudioCaps is extremely complicated. Because the data is from Youtube, in theory, once one of the videos is removed by the uploader, we need to remove it from the dataset. So we basically do not have the effort to build an update mechanism. And you could imagine other datasets has their own copyright restrictions.

Also, if you would like to access the data (or on request for other academics), you could always try to contact LAION since our internal data is hosted inside LAION. However, I would doubt that would be a straightforward process, as we might have copyright issues by spreading the data.

Best, Yusong

assafmu @.***> 于2023年1月24日周二 06:53写道:

Hi, Thank you for the response! I'll try using the Clotho dataset, and the HTSAT checkpoint, and will report results.

However, results on the Clotho dataset were already similar to those reported in the paper. The difference I saw was in the AudioCaps dataset. I'm no expert on this, but maybe making the AudioCaps dataset (or others) only available on request for other academics could solve the copyright issue? Thanks in advance, Assaf

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/CLAP/issues/59#issuecomment-1401814772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGXJZ7ZRTMMCBL2LV3OZ3LLWT67ENANCNFSM6AAAAAATURNWPM . You are receiving this because you commented.Message ID: @.***>

joemzhao commented 1 year ago

Hey! I'm trying to reproduce the results for the HTSAT-RoBERTa setup, with AudioCaps and Clotho as the training sets. Results are good for Clotho-valid, and similar to the paper, but are much worse for AudioCaps-valid. T2A R@k of (7, 40, 54) instead of (36, 70, 83).

I'm using HTSAT-tiny, checkpoint downloaded from the official repo.

I've also tried using the CLAP published checkpoints, and adapting the eval_retrieval_freesound.sh script to run on Audiocaps and Clotho by passing the correct arguments. The script reported audiocaps valid T2A R@k of (8, 38, 57) instead of the (32, 68, 81) for this checkpoint. Results for Clotho valid are similar to those in the paper.

I'm not sure what could be the problem, since I'm using the existing checkpoint and the AudioCaps dataset preparation code from the LAION repo. Thanks in advance!

hey @assafmu, I wonder if you've found a solution? I also met similar problem when working on AudioCaps, and got similar results as T2A R@k of (8, 38, 57)

assafmu commented 1 year ago

I did not find a solution. AudioCaps is a difficult dataset, and since a processed version is not publicly available (copyright issues), it is very unlikely to reproduce those results, even with the same code. Trying to recreate the AudioCaps dataset, and reproducing its processing, was also not practical.

bdevnani3 commented 7 months ago

@lukewys Would you be able to release the size of the eval/test sets for the version audio caps reported in the paper? It would be helpful to put the retrieval metrics in perspective.