YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

data preparation #15

Closed zhaoyanpeng closed 2 years ago

zhaoyanpeng commented 2 years ago

hi yuan, would it be better to elaborate on how to ensure that flac audios are single-channel?

YuanGongND commented 2 years ago

Hi there,

Flac audios can be multi-channel, however, we only use single channels information (our AudioSet data is single-channel). If you have multi-channel audio, you can just use the first channel, which can be simply done by

waveform, sr = torchaudio.load(filename)
waveform = waveform[0, :]

-Yuan

zhaoyanpeng commented 2 years ago

thanks for the reply. that is what I am doing. I am wondering what you did to get single-channel flac audios.

zhaoyanpeng commented 2 years ago

... it looks your flac audios are always single-channel from the code. I just wondering how come. thanks.

YuanGongND commented 2 years ago

Yes, the data I have are all single-channel. It was not me who downloaded the data, so I am not clear on how exactly it was done. But I am quite sure that the single-channel audios were achieved by a naive method like the sample code I showed above (i.e., no beamforming is used), so I don't think that is an important thing.

To use our pretrained model, I think 16kHz single-channel audio with .wav or .flac format should both work - normalization needs to be taken care of if the scale of your data is different.