How to use the pre-trained model on the AudioSet to extract audio features and save them as npy?

YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BSD 3-Clause "New" or "Revised" License

1.06k stars 203 forks source link

How to use the pre-trained model on the AudioSet to extract audio features and save them as npy? #98

Open ayameyao opened 1 year ago

YuanGongND commented 1 year ago

Hi there,

I think the best way is to return x at this point.

https://github.com/YuanGongND/ast/blob/9e3bd9942210680b833b08c39d09f2284ddc4d1d/src/models/ast_models.py#L184

Note - x is a sequence, in time-frequency order (i.e., not in time, and not in frequency order), starting with two [cls] tokens. If you don't care sequence, just return after x = (x[:, 0] + x[:, 1]) / 2, else return x before x = (x[:, 0] + x[:, 1]) / 2 and do your desired operation.

-Yuan

YuanGongND commented 1 year ago

To save, you can just convert x to a numpy by x=x.detach().cpu().numpy(), and then np.save(filename, x).