Inaccurate Splitting with whisper

JarodMica / audiosplitter_whisper

MIT License

91 stars 35 forks source link

Inaccurate Splitting with whisper #14

Open Dannypeja opened 11 months ago

Dannypeja commented 11 months ago

Hey I wanted to ask if anybody else is facing this issue. I am using parts of this repo to split up a long text for utterances and speakers with diarization.

However the split text is not word accurate. Most of the split parts are cut in the end, which is quite bad for TTS datasets. Increasing padding didn't solve this cut words issue, it just delayed them to a later position of the sentence.

Any ideas? Thanks a lot!

JarodMica commented 11 months ago

I believe there are some accuracy limitations if there are multiple speakers in the audio as this can throw off time stamps. You can try to clean out as much background noise as possible with UVR to help out whisper, but I've noticed the accuracy of it varies from dataset to dataset. You might be able to try out my other repo that uses a silence threshold to split a dataset to see if that works better for your needs: https://github.com/JarodMica/audiosplitter

Note, this does not have speaker diarization though.

Dannypeja commented 11 months ago

Yeah the Diarization is key for my purpose unfortunately. Thanks for the quick response! Usining silence threshold I found audacity to be the gold standard.

I propose the following workaround for my issue: Use whisper to detect whenever speaker x starts talking. Export the Marks to an editing software. Manually cut out longer talk sections. Not use audacity or your audiosplitter to get good datasets. Then use transcription again