RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
341 stars 62 forks source link

How to perform localization and generate heatmap with AudioSet #25

Closed samuelladyanov closed 1 year ago

samuelladyanov commented 1 year ago

Hello RetroCirce, I've been getting familiar with your codebase and I am having issues performing localization and generating heatmaps with the AudioSet dataset. I understand to perform localization with DESED you use the following code:

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

Is there an AudioSet equivalent to this code? I also noticed in fl_evaulate.py an option to process AudioSet with the following code:

# audioset_process(config.heatmap_dir, config.test_file)

but this function doesn't exist anywhere in the codebase.

Thank you very much for your time.

RetroCirce commented 1 year ago

Yeah, this function is a temporary function, as you might know that AudioSet has released a small subset with strong localization labels last year. So I processed the data in the company's server for later use, but now I could not access it.

I think doing the localization on AudioSet is different from DESED, there are two differences I would suggest you need to write your own code for processing it:

  1. if you want to train a new HST-AT model by localization data (my HTS-AT can support it but I did not write it), you need to extract different output of HST-AT (I believe it is the last second layer feature-map output), and have a loss function to converge it. Actually this might become a new work. One thing to keep in mind is that the interpolation and resolution of the output may be different from the input localization time resolution ----- in that you need to find a way to align them.
  2. If you want to evaluate the model on localization dataset, fl_evaluate.py can be served as a code-base but you need to revise something: (1) AudioSet's classes are different from DESED's, you can see I do a map from 527 classes to 10 classes in DESED. In AudioSet, I think it is more easy since you don't need to do the map again. (2) Somewhere in the fl_evaluate.py: there are some fixed numbers of different thresholds for determining different classes. If you read some localization papers, you might know that different classes might have different thresholds (not all classes are 0.5) to be determined. Usually the thresholds are obtained from "inferring" on training dataset, and doing the quantization (for me, the 0.1-quantization), and then you can use these thresholds to infer on the evaluation data. So you might need to calculate the threshold of AudioSet classes by yourself.

Please let me know if you could get more results from localization performance on HST-AT, which is one unfinished work and valuable work of HTS-AT in the future.