LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.42k stars 137 forks source link

Reproducing FSD50K SV result #156

Open anithselva opened 4 months ago

anithselva commented 4 months ago

Hello,

I'm trying to reproduce the fine-tuning result on FSD50K.

I've tried multiple checkpoints but am not able to reach the 0.649 mAP in Table 4 of the paper.

Here is the results I've been able to attain:

Checkpoint music_audioset_epoch_15_esc_90.14.pt Fine-Tuned mAP: 0.499

Checkpoint music_speech_audioset_epoch_15_esc_89.98.pt Fine-Tuned mAP: 0.503

I've also tried the latest checkpoints that use the HTSAT-tiny audio model, with similar result.

Here is my setup as per the finetinetune-fsd50k.sh script:

python -m evaluate.eval_linear_probe \
    --save-frequency 50 \
    --save-top-performance 3 \
    --save-most-recent \
    --dataset-type="webdataset" \
    --precision="fp32" \
    --warmup 0 \
    --batch-size=40 \
    --lr=1e-4 \
    --wd=0.1 \
    --epochs=100 \
    --workers=8 \
    --use-bn-sync \
    --freeze-text \
    --amodel HTSAT-base \
    --tmodel roberta \
    --report-to wandb \
    --wandb-notes "10.14-finetune-fsd50k" \
    --datasetnames "FSD50K_webdataset" \
    --datasetinfos train \
    --seed 3407 \
    --datasetpath /home/ubuntu/datasets/processed \
    --logs /home/ubuntu/CLAP/clap_logs \
    --gather-with-grad \
    --lp-loss="bce" \
    --lp-metrics="map" \
    --lp-lr=1e-4 \
    --lp-mlp \
    --class-label-path="/home/ubuntu/CLAP/class_labels/FSD50k_class_labels_indices.json" \
    --openai-model-cache-dir /home/ubuntu/CLAP/.cache \
    --pretrained="/home/ubuntu/CLAP/pretrained" \
    --data-filling "repeatpad" \
    --data-truncating "rand_trunc" \
    --optimizer "adam"