Adding fix for multi-byte segments in whisper.cpp

raivisdejus commented 5 months ago

Sometimes transcription in Latvian failed with error Failed utf-8 codec can't decode byte 0xc4 in position 0: unexpected end of data. This seems to be referenced in https://github.com/ggerganov/whisper.cpp/issues/1798 where multi-byte utf-8 characters get returned in separate segments and uft-8 decoder fails to process them. This PR fixes this issue.

This PR also fixes issue where with "Word-level timings" setting enabled words get split into separate segments making this feature less usable in real world situations. Changes in PR will combine whisper.cpp segments around word boundary of space.

The unclear part is in regards to languages where space may not be proper word boundary. If someone has relevant comments on word boundaries in languages like Chinese, I am happy to adjust the solution.

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 78.94737% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 81.30%. Comparing base (d483864) to head (5b85a81). Report is 3 commits behind head on main.

:exclamation: Current head 5b85a81 differs from pull request most recent head 3513158. Consider uploading reports for the commit 3513158 to get more accurate results

Files	Patch %	Lines
buzz/transcriber/whisper_cpp.py	78.94%	8 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #734 +/- ## ========================================== - Coverage 81.97% 81.30% -0.68% ========================================== Files 83 81 -2 Lines 3840 3610 -230 ========================================== - Hits 3148 2935 -213 + Misses 692 675 -17 ``` | [Flag](https://app.codecov.io/gh/chidiwilliams/buzz/pull/734/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Chidi+Williams) | Coverage Δ | | |---|---|---| | [Linux](https://app.codecov.io/gh/chidiwilliams/buzz/pull/734/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Chidi+Williams) | `?` | | | [Windows](https://app.codecov.io/gh/chidiwilliams/buzz/pull/734/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Chidi+Williams) | `81.30% <78.94%> (-0.07%)` | :arrow_down: | | [macOS](https://app.codecov.io/gh/chidiwilliams/buzz/pull/734/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Chidi+Williams) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Chidi+Williams#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

chidiwilliams commented 5 months ago

Awesome, thank you. Given you commit access to the repo if you're interested in joining as well. Cheers.

chidiwilliams / buzz

Adding fix for multi-byte segments in whisper.cpp #734

Codecov Report