Closed DrKain closed 1 year ago
Hey, would be interested in taking a stab at this if you wouldn't mind? Will have some time this weekend / next week to properly investigate.
Sure thing
Have made some initial progress on this:
For example, if the subtitle was Movie Title (1990).de.srt the extracted language code would be de (German).
When cleaning a file without a language code, default to the main filters. If the language code is found, load and apply the language based filters.
Have managed to extract the language code and run associated filters; have tested this initially with a de-main.json file for German subtitles (my German is very basic, so this may need amending if you / anyone else is aware of specific strings that should be filtered etc.
Have also included log messages to alert the user to whether a language code has been detected and associated filter used, or if the regular, main filters have been used instead.
Have yet to think about this:
Support for a --lang parameter should also be added, allowing users to load and apply multiple language filters and overriding the language code found in the file name. Example parameter: --lang=de,en,he loads English, Hebrew and German filters.
...if you / anyone else is aware of specific strings that should be filtered etc.
The existing English filters would have a lot of crossover, but any others would need to be added as they're discovered in existing subtitles.
Part of the reason I've delayed this issue for so long, instead relied upon custom.json
file, was because downloading and looking through hundreds of foreign subtitles and creating rules is more tedious than it sounds.
Once support for other languages is added I would be relying almost entirely on user contributions to fill most of them.
Part of the reason I've delayed this issue for so long, instead relied upon custom.json file, was because downloading and looking through hundreds of foreign subtitles and creating rules is more tedious than it sounds. Once support for other languages is added I would be relying almost entirely on user contributions to fill most of them.
For sure - the functionality for parsing them should be good to go though, but you're right, probably relies heavily on native speakers for specific language filters etc!
Hi @winterborn, any update on this? There are a few other issues I'd like to tackle at some point that would benefit from this issue. No pressure though.
Hey @DrKain, sorry, work has got in the way recently so I haven't had time to come back to this and work on passing in the --lang
parameter etc.
I've completed the extraction of the language code and its associated filters, but like I said, I've tested this with a very rudimentary subtitle.de.srt
file... so I guess the language specific json files could be moved to another issue for native speakers to pick up as and when?
I can push the above though if it helps your issues and could create a separate issue for passing the --lang
parameter to pick up when I / someone else has free time :)
Let me know what you think!
Sure. If you wouldn't mind creating a PR into another branch I can continue working on it.
Just published the new version, you've been credited in the update description. Thanks again for the help
https://github.com/DrKain/subclean/releases/tag/1.5.1
That's awesome, thanks for the credit! - Hope it helped!
Right now the filters are almost exclusively English, I'd like this tool to support other languages to be used when the language code matches or when a parameter is passed, EG:
--lang=de
/--lang=en,de
The filters directory should contains language based filters using the two-letter ISO 639-1 code. The filter file would be named accordingly, probably something like
de-main.json
orde.main.json
.If you think of a better name/format, please feel free to comment your thoughts here.
For example, if the subtitle was
Movie Title (1990).de.srt
the extracted language code would bede
(German).When cleaning a file without a language code, default to the main filters.
If the language code is found, load and apply the language based filters.
Support for a
--lang
parameter should also be added, allowing users to load and apply multiple language filters and overriding the language code found in the file name. Example parameter:--lang=de,en,he
loads English, Hebrew and German filters.If a language that's not supported is requested a message should be displayed to the user and the clean should continue as usual.
Right now I don't have the time to add this, so the issue is open if another user would like to take a shot.
As an extra note, you can apply your own non-english filters using the custom filters until this issue is implemented