RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
341 stars 62 forks source link

Key to checkpoints in drive #44

Open Sreyan88 opened 1 year ago

Sreyan88 commented 1 year ago

Hello! Thank You for the awesome repo and work! I want to use the fine-tuned audioset encoder for the large variant of the model. However, I am confused from the checkpoints provided on which one to choose.

Would it be possible to provide a key to the checkpoints stored on drive?

Thank You!

RetroCirce commented 1 year ago

Hi,

The default checkpoint is in AudioSet folder. There are six available checkpoints and either of them can have a similar map reported in the paper (around 0.465-0.473).

The checkpoints in ESC and SCV2 folders are the fine-tuned checkpoints on ESC-50 and SCV2 datasets. The reported performance is on the paper.

The other setting folder shows some checkpoints we added later or before. Such as using/without using the imagenet checkpoint when training on Audioset. Or the model training on different sampling rate (such as 48000hz). You can try for that if you found them helpful. They actually also got very good performance.

Sreyan88 commented 1 year ago

Hi @RetroCirce , Thank You for the reply. Are they all tiny models or base/large in the AudioSet folder and what is the sampling rate you used for them?

Additionally, can you provide a bit more info (size and sampling rate) about the ckpts here: link. Thank You!

RetroCirce commented 1 year ago

Hi, I finally get what you need (by looking at this issue and CLAP's issue lol). Here is what you want: https://drive.google.com/drive/folders/1SMQyzJvc6DwJNuhQ_WI8tlCFL5HG2vk6 I change one ckpt's name more correctly as "tiny". Sorry for making it unclear, at that time we just have some many ckpts and Yusong and I were actually quite known for all these naming.

In the HTS-AT ckpt link (for this repo). The 48000hz-tiny model is not provided, because it was trained when we proposed the CLAP.