RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
344 stars 62 forks source link

get bad result for esc50 #5

Closed wac81 closed 2 years ago

wac81 commented 2 years ago

i have one GPU, so i changed some code in model.py and sed_model.py and main.py

and set config.py like that:

dataset_type = "esc-50" loss_type = "clip_ce" sample_rate = 32000 classes_num = 50

then i just get ACC : 0.55

i changed

deterministic=False dist.init_process_group( backend="nccl", init_method="tcp://localhost:23456", rank=0, world_size=1 ) for init_process_group error

code can be runing. but not get results same as your paper.

RetroCirce commented 2 years ago

Hi,

Make sure you change these for config.py:

  1. when training the ESC-50 dataset, we use the AudioSet pretrained checkpoint, so you need to save the checkpoint trained on AudioSet (or use our saved checkpoint) and put it in "resume_checkpoint", and do the training for ESC-50.
  2. the learning rate can be 1e-4 for ESC-50 training since it is the fine-tune, but 1e-3 is also workable.
  3. the deterministic = False/True might not effect the result too much, but if you use our AudioSet checkpoint, it is better to be True
  4. ESC-50 dataset is originally not in 32000 Hz but in 44100 Hz. As mentioned in our readme. we provide a code to process it.
  5. dist.init_process_group( backend="nccl", init_method="tcp://localhost:23456", rank=0, world_size=1 ) -> we do not have this code, this is automatically handled by pytorchlightening, where is it come from?
wac81 commented 2 years ago
  1. i just have single one GPU. so need to modify some code in sed_model.py and main.py. deterministic = False -----necessary dist.init_process_group( backend="nccl", init_method="tcp://localhost:23456", rank=0, world_size=1 )-----necessary too if i didn't modify these code, can't run in single one GPU. OR could you give me correct modify code to support single GPU?

  2. I failed to do only one of the things you mentioned which is fine-tune on AudioSet checkpoint, This is not mentioned in Readme doc.

  3. and DESED datasets need to finetune on AudioSet checkpoint as well?

RetroCirce commented 2 years ago

Hi.

  1. You can follow these steps to run the model in GPU: (1) the model can be train automatically in single GPU, the problem is that it cannot be validated or tested in a single GPU because we did not make it available in sed_model.py (2) to support this, you need to understand three methods in SEDWrapper in sed_model.py: evaluate_metric, validation_epoch_end and test_epoch_end. evaluate_metric is to calculate the mAP (for audioset) or acc (for ESC-50 and SCV2) validation_epoch_end and test_epoch_end are almost the same: you obtain the model output from step_outputs. The logic here is: In the multi-gpu model, you obtain the model output from different gpus. In that you need to gather them together and evaluate them together, which is what the method "dist.gather" does, and some methods such as "dist.get_world_size" and "dist.get_rank" is to make sure that it is running in the multi-gpu mode.

You probably know that if you want to run it in the single gpu model. You don't need to "gather" output from different gpus, you can directly send the output into the evalute_metric method to get the result. Therefore, you just need to: (1) add another condition to judge if the model is running on single GPU (by using torch.cuda.device_count() == 1) (2) directly send the output into the evaluate_metric method to get the result. (you need to understand the output shape by tracking in the "test_epoch_end" output.

  1. We mention this in the paper section 3.2.2, and generally many previous papers use AudioSet pretrained models to get the results for ESC-50.
  2. Yes, and DESED even does not need a fine-tune, we did not even train the model on DESED training set, because AudioSet contains all classes needed in DESED, we just did a 527->10 class mapping, as fl_audioset_mapping in config.py
wac81 commented 2 years ago

thank you for your reply.

i have a question for own datasets. i want train own datasets for three classes include (good bad other 3)emotion. i have to finetune on AudioSet pretrained models? i can't mapping 527->3.

RetroCirce commented 2 years ago

Hi

No, you don’t need to map 527 to 3. Similarly to ESC-50, you just replace the last mapping fully connected layer from 527 output to 3 output, and the previous layers will apply the audioset pretrained model weight, then you do the fine-tune.

Or, you can train from scratch, if you have many audio-mood data, I believe you could also achieve good results.

wac81 commented 2 years ago

thanks a lot I will try