RetroCirce / HTS-Audio-Transformer

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
https://arxiv.org/abs/2202.00874
MIT License
341 stars 62 forks source link

How can i use my own dataset in this model: #29

Closed yyssxxx closed 1 year ago

yyssxxx commented 1 year ago

Hello, both esc-50 and audioset are data sets with 5 seconds and 10 seconds, and each audio is not more than 10 seconds. But each audio in my own dataset is 60 seconds. In this case, can your model be used in my dataset? How can I modify the code to use it on my dataset?

If you can answer, I would be grateful!

RetroCirce commented 1 year ago

Hi,

If your data is about 60-sec length, HTS-AT might not directly process it because the transformer cannot process the file which is larger than the maximum length. I provide two options to handle this problem, we also used them before: (1) if you want to train the model with 60-sec data, each time you can randomly slice one 10-sec clip from a 60-sec data, you assume this 10-sec is enough to be classified to the target label. When testing or validating, you slide the 60-sec from beginning to the end to some slices (e.g., 0-10, 5-15, 10-20, 15-20 ....), and you feed in these slices into the model to get each output, then you can take the average of them (slice -> vote) to get the final decision of the prediction.

(2) the above method is very time-consuming when you do the test because you need to infer one audio sample by N times to get the answer. In our new proposed paper (https://arxiv.org/abs/2211.06687), which is not an audio classification work but a representation learning work, we use the HTS-AT but adopt a feature fusion mechanism to handle the longer length problem, to make a trade-off between the performance and the cost, you can probably take a look.