OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
935 stars 57 forks source link

About preparing an audio file for analysis #42

Closed paapu88 closed 9 months ago

paapu88 commented 9 months ago

Dear Developers, I prepared a sound file in .flac format and asked one-peace to classify it, to my suprise, it was not 'male-speech' as expected but

audio: assets/talk1.flac, predict label: mouse clicking (setting assets/talk1.flac instead of assets/cow.flac in your README.md)

That file has following parameters: samplerate: 16000 Hz channels: 1 duration: 4.992 s format: FLAC (Free Lossless Audio Codec) [FLAC] subtype: Signed 16 bit PCM [PCM_16]

Which seem to be same as with file assets/cow.flac.

What could be reason for the mis-classification, should the sound file be prepared in some special fashion? The talk1.flac file is attached (in zippet format because of github restrictions) talk.zip

Terveisin, Markus

logicwong commented 9 months ago

It might be that the current classification model finetuned with VGGSound is not robust enough. When I increased the duration of talk1.flac to 15 seconds, the model accurately classified it. You can try making this modification and see if it improves the accuracy on your dataset.

https://github.com/OFA-Sys/ONE-PEACE/commit/4c97744c66cafe0c1e10907e82dff334b8dbb40b

paapu88 commented 9 months ago

Well, I recorded my voice for 15s and the classification was still wrong, this time it was

audio: assets/talk1.flac, predict label: sharpen knife (the file is attached) talk2.zip

Are there some steps in preparation of audio I'm missing?

logicwong commented 9 months ago

It appears that the model's top-1 accuracy is insufficient. If you don't have a strict requirement for top-1 accuracy, try consider outputting the top-5 predicted labels instead?

paapu88 commented 9 months ago

Ok, thanks.