Closed GOvEy1nw closed 6 months ago
The standalone version doesn’t appear to have any source code so I can’t decipher what’s happening. We use stable-ts, but there are different ways to split the dialogue. See https://github.com/jianfch/stable-ts?tab=readme-ov-file#regrouping-words. Open to any suggestions.
I made a separate branch if you want to toy with the idea: https://github.com/McCloudS/subgen/blob/Custom-Params/subgen.py
It takes custom_regroup = os.getenv('CUSTOM_REGROUP', '')
Where it is the regroup string as mentioned above. The default ran on the model is cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?
Instructions pasted below:` Regroup (in-place) words into segments.
Parameters
----------
regroup_algo: str or bool, default 'da'
String representation of a custom regrouping algorithm or ``True`` use to the default algorithm 'da'.
verbose : bool, default False
Whether to show all the methods and arguments parsed from ``regroup_algo``.
only_show : bool, default False
Whether to show the all methods and arguments parsed from ``regroup_algo`` without running the methods
Returns
-------
stable_whisper.result.WhisperResult
The current instance after the changes.
Notes
-----
Syntax for string representation of custom regrouping algorithm.
Method keys:
sg: split_by_gap
sp: split_by_punctuation
sl: split_by_length
sd: split_by_duration
mg: merge_by_gap
mp: merge_by_punctuation
ms: merge_all_segment
cm: clamp_max
l: lock
us: unlock_all_segments
da: default algorithm (cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?)
rw: remove_word
rs: remove_segment
rp: remove_repetition
rws: remove_words_by_str
fg: fill_in_gaps
Metacharacters:
= separates a method key and its arguments (not used if no argument)
_ separates method keys (after arguments if there are any)
+ separates arguments for a method key
/ separates an argument into list of strings
* separates an item in list of strings into a nested list of strings
Notes:
-arguments are parsed positionally
-if no argument is provided, the default ones will be used
-use 1 or 0 to represent True or False
Example 1:
merge_by_gap(.2, 10, lock=True)
mg=.2+10+++1
Note: [lock] is the 5th argument hence the 2 missing arguments inbetween the three + before 1
Example 2:
split_by_punctuation([('.', ' '), '。', '?', '?'], True)
sp=.* /。/?/?+1
Example 3:
merge_all_segments().split_by_gap(.5).merge_by_gap(.15, 3)
ms_sg=.5_mg=.15+3`
I'm still toying around, but cm_sl=84_sl=42++++++1
does the double lines if the dialog exceeds a certain time. Otherwise, it will still try to find natural breaks.
Hey, I'm a windows user, and I'm really grateful for Subgen as it's the simplest way to get Whisper running with Bazarr on Windows without having to use Docket etc.
However, one thing I've noticed is that the subtitles aren't formatted the best, due to how Faster-Whisper operates. I've found that the standalone Faster Whisper (https://github.com/Purfview/whisper-standalone-win) has a great optional argument called --standard, which does the following:
--standard: Quick hardcoded preset to split lines in standard way. 42 chars per 2 lines with max_comma_cent=70 and --sentence are activated automatically.
--sentence: Enables splitting lines to sentences for srt and vtt subs. Every sentence starts in the new segment. Be default meant to output whole sentence per line for better translations, but not limited to, read about '--max_...' parameters.
This gives the subtitles a much more standardized look that are common across streaming services such as Netflix, BBC etc.
Is it possible to implement these into SubGen, please?