I have a 5-minute audio file, and the wav2vec features obtained by direct inference and the wav2vec features obtained by cropping into a 10s segment are inconsistent. Is it possible that the accuracy of the results obtained by direct inference of long audio is low? So, how long audio should I crop to get the best result?
I have a 5-minute audio file, and the wav2vec features obtained by direct inference and the wav2vec features obtained by cropping into a 10s segment are inconsistent. Is it possible that the accuracy of the results obtained by direct inference of long audio is low? So, how long audio should I crop to get the best result?