DrKain / subclean

A cross-platform CLI tool and node module to remove advertising from subtitles. Supports Bazarr and bulk cleaning!
MIT License
56 stars 5 forks source link

[Feature Request] More regex support in filters #18

Closed DrKain closed 2 years ago

DrKain commented 2 years ago

Right now there's minimal regex support in the filters but I'd like to treat any filter starting with a forward slash as a regular expression.

Right now the filters can run ~50k checks in about a second (150 filters on 344 text nodes) so having a large lists of filters isn't going to harm optimization that much, but I would like to avoid having to write two filters for something like subtitles: and subtitles :.

Readability is priority so I while I could turn 20 filters into a single regex I'd rather avoid giant expressions.
For example, the following regex (demo):

/^(encoded|timing|subtitle(s|)|transcript|resync|ripped (and|\&) corrected)\s*(created by|by|:)/

... would replace as many as 14 rules at once.
Instead, I would rather smaller, more readable regex that are easier to edit and understand but still cut back on near-duplicate filters.

/^timing\s*:/
/^transcript\s*(by|:)/
/^subtitles\s*(by|:)/
/^ripped (and|\&) corrected by/
/^.\:\:\s*(sync|timings|transcript)/
/^sync(ed|) (and|\&) correct(ed|ion) by/

This would shorten the main filter list by at least 30 (~10k checks / 20% faster)

DrKain commented 2 years ago

I'll probably handle this in a few days. There's already regex support so it will be a simple update.
I'll take a look at #12 at the same time and see if I can't close that one too.

https://github.com/DrKain/subclean/blob/2425a399f815465486e36beffa54e99c889402a3/src/index.ts#L276-L278