Purfview / whisper-standalone-win

Whisper & Faster-Whisper standalone executables for those who don't want to bother with Python.
1.2k stars 62 forks source link

Three transcription problems #295

Open oep42 opened 1 month ago

oep42 commented 1 month ago

Using Faster-Whisper-XXL_r192.3.4_windows.

Initial

Problem with inital in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf problem 2-B

Problem with initals in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf Problem 2-A

Hyphen

Problem with hyphen in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf Problem 3-C

Problem with hyphen in "Glass-Steagall" in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf problem 3-B

Problem with hyphen in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf problem 3 (When a line break followed by a hyphen, is automatically removed or repositioned in Subtitle Edit, it is replaced by a space. If this is removed automatically from this sentence, it will change into this: "In 1936, Roosevelt stood for re -election.")

Instead of like this:

There was something about this middle
-class Florentine housewife that

It should look like this:

There was something about this middle-
class Florentine housewife that

This would be even better (no risk of problems with subsequent processing):

There was something about this
middle-class Florentine housewife that

This would also be better:

There was something about this middle-class
Florentine housewife that

This problem with a hyphen, may be related to what is discussed here.

Comma

Problem with comma in a transcription with --standard: Purf Problem 1-A

Problem with comma in a transcription with --max_line_width 37 --max_line_count 2 --sentence --max_comma_cent 50: Purf Problem 1-B (When a line break followed by a comma, is automatically removed in Subtitle Edit, it is replaced by a space. If this is removed automatically from this subtitle, it will change into this: "under the Linden, how there were 100 ,000 of people when they passed".)

It would be best if line breaks were only created at positions where the line break can be replaced by a space character.

Added later: It may not be possible to find a satisfactory solution for the problem with initials. See here.

Added later: Not sure how many people pronounce the English abbreviations "i.e." and "e.g." as letters, but if they do, something like this can happen: abbreviation ie

kadrimarzouki commented 1 month ago

Any updates on this request whatsoever?