Closed raivisdejus closed 5 months ago
Attention: Patch coverage is 78.94737%
with 8 lines
in your changes are missing coverage. Please review.
Project coverage is 81.30%. Comparing base (
d483864
) to head (5b85a81
). Report is 3 commits behind head on main.:exclamation: Current head 5b85a81 differs from pull request most recent head 3513158. Consider uploading reports for the commit 3513158 to get more accurate results
Files | Patch % | Lines |
---|---|---|
buzz/transcriber/whisper_cpp.py | 78.94% | 8 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Awesome, thank you. Given you commit access to the repo if you're interested in joining as well. Cheers.
Sometimes transcription in Latvian failed with error
Failed utf-8 codec can't decode byte 0xc4 in position 0: unexpected end of data
. This seems to be referenced in https://github.com/ggerganov/whisper.cpp/issues/1798 where multi-byte utf-8 characters get returned in separate segments and uft-8 decoder fails to process them. This PR fixes this issue.This PR also fixes issue where with "Word-level timings" setting enabled words get split into separate segments making this feature less usable in real world situations. Changes in PR will combine whisper.cpp segments around word boundary of space.
The unclear part is in regards to languages where space may not be proper word boundary. If someone has relevant comments on word boundaries in languages like Chinese, I am happy to adjust the solution.