Open kayleeliyx opened 1 month ago
I am able to train the model with parameters settings like this:
"--amodel=HTSAT-tiny"
, but I don't know what's the difference between HTSAT-base and why the model here https://huggingface.co/lukewys/laion_clap/tree/main doesn't work. Also how to change the dimension of the input 'audio_projection.0.weight'
layer in the projection layer in the audio encoder, where it is expected to be [512, 2048].
The shape is only expected to be 2048 if you use the PANN-14 architecture, i.e. --amodel PANN-14
.
However, as far as I know, the authors haven't released any pretrained weights based on the PANN-14 architecture.
Both HTSAT-tiny and HTSAT-base will use a 768 output dimension. The former is just a smaller version of the latter. HTSAT-tiny requires significantly less GPU memory and might be able to support larger batch sizes, but personally I found its performance to be worse than HTSAT-base.
These weights use the HTSAT-tiny architecture:
630k-audioset-best.pt 630k-audioset-fusion-best.pt 630k-best.pt 630k-fusion-best.pt
And these use the HTSAT-base:
music_audioset_epoch_15_esc_90.14.pt music_speech_audioset_epoch_15_esc_89.98.pt music_speech_epoch_15_esc_89.25.pt
Hope this helps
i am having exactly the same issue as this one: https://github.com/LAION-AI/CLAP/issues/162 the loaded model should be
630k-audioset-best.pt
. i am trying to use ESC50 to finetune the clap model. later i want to use my own dataset to finetune the model.thanks a lot for your help!