How can i use my own dataset in this model:

Hi,

If your data is about 60-sec length, HTS-AT might not directly process it because the transformer cannot process the file which is larger than the maximum length. I provide two options to handle this problem, we also used them before: (1) if you want to train the model with 60-sec data, each time you can randomly slice one 10-sec clip from a 60-sec data, you assume this 10-sec is enough to be classified to the target label. When testing or validating, you slide the 60-sec from beginning to the end to some slices (e.g., 0-10, 5-15, 10-20, 15-20 ....), and you feed in these slices into the model to get each output, then you can take the average of them (slice -> vote) to get the final decision of the prediction.

(2) the above method is very time-consuming when you do the test because you need to infer one audio sample by N times to get the answer. In our new proposed paper (https://arxiv.org/abs/2211.06687), which is not an audio classification work but a representation learning work, we use the HTS-AT but adopt a feature fusion mechanism to handle the longer length problem, to make a trade-off between the performance and the cost, you can probably take a look.

RetroCirce / HTS-Audio-Transformer

How can i use my own dataset in this model: #29