DrKain / subclean

A cross-platform CLI tool and node module to remove advertising from subtitles. Supports Bazarr and bulk cleaning!
MIT License
54 stars 5 forks source link

[Feature Request] Support for other languages #23

Closed DrKain closed 1 year ago

DrKain commented 1 year ago

Right now the filters are almost exclusively English, I'd like this tool to support other languages to be used when the language code matches or when a parameter is passed, EG: --lang=de / --lang=en,de

The filters directory should contains language based filters using the two-letter ISO 639-1 code. The filter file would be named accordingly, probably something like de-main.json or de.main.json.
If you think of a better name/format, please feel free to comment your thoughts here.

For example, if the subtitle was Movie Title (1990).de.srt the extracted language code would be de (German).

When cleaning a file without a language code, default to the main filters.
If the language code is found, load and apply the language based filters.

Support for a --lang parameter should also be added, allowing users to load and apply multiple language filters and overriding the language code found in the file name. Example parameter: --lang=de,en,he loads English, Hebrew and German filters.

If a language that's not supported is requested a message should be displayed to the user and the clean should continue as usual.


Right now I don't have the time to add this, so the issue is open if another user would like to take a shot.
As an extra note, you can apply your own non-english filters using the custom filters until this issue is implemented

winterborn commented 1 year ago

Hey, would be interested in taking a stab at this if you wouldn't mind? Will have some time this weekend / next week to properly investigate.

DrKain commented 1 year ago

Sure thing

winterborn commented 1 year ago

Have made some initial progress on this:

For example, if the subtitle was Movie Title (1990).de.srt the extracted language code would be de (German).

When cleaning a file without a language code, default to the main filters. If the language code is found, load and apply the language based filters.

Have managed to extract the language code and run associated filters; have tested this initially with a de-main.json file for German subtitles (my German is very basic, so this may need amending if you / anyone else is aware of specific strings that should be filtered etc.

Have also included log messages to alert the user to whether a language code has been detected and associated filter used, or if the regular, main filters have been used instead.

Have yet to think about this:

Support for a --lang parameter should also be added, allowing users to load and apply multiple language filters and overriding the language code found in the file name. Example parameter: --lang=de,en,he loads English, Hebrew and German filters.

DrKain commented 1 year ago

...if you / anyone else is aware of specific strings that should be filtered etc.

The existing English filters would have a lot of crossover, but any others would need to be added as they're discovered in existing subtitles. Part of the reason I've delayed this issue for so long, instead relied upon custom.json file, was because downloading and looking through hundreds of foreign subtitles and creating rules is more tedious than it sounds.
Once support for other languages is added I would be relying almost entirely on user contributions to fill most of them.

winterborn commented 1 year ago

Part of the reason I've delayed this issue for so long, instead relied upon custom.json file, was because downloading and looking through hundreds of foreign subtitles and creating rules is more tedious than it sounds. Once support for other languages is added I would be relying almost entirely on user contributions to fill most of them.

For sure - the functionality for parsing them should be good to go though, but you're right, probably relies heavily on native speakers for specific language filters etc!

DrKain commented 1 year ago

Hi @winterborn, any update on this? There are a few other issues I'd like to tackle at some point that would benefit from this issue. No pressure though.

winterborn commented 1 year ago

Hey @DrKain, sorry, work has got in the way recently so I haven't had time to come back to this and work on passing in the --lang parameter etc.

I've completed the extraction of the language code and its associated filters, but like I said, I've tested this with a very rudimentary subtitle.de.srt file... so I guess the language specific json files could be moved to another issue for native speakers to pick up as and when?

I can push the above though if it helps your issues and could create a separate issue for passing the --lang parameter to pick up when I / someone else has free time :)

Let me know what you think!

DrKain commented 1 year ago

Sure. If you wouldn't mind creating a PR into another branch I can continue working on it.

DrKain commented 1 year ago

Just published the new version, you've been credited in the update description. Thanks again for the help
https://github.com/DrKain/subclean/releases/tag/1.5.1

winterborn commented 1 year ago

That's awesome, thanks for the credit! - Hope it helped!