common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Add a new rule - stem_separator_regex #187

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

This will mainly be useful for blacklisted proper names with suffixes. If you blacklist the stem word (e.g. a person's name) it should be enough.

If specified, the code splits words at the given characters to reach the stem words to check them again against the blacklist, e.g. prevents "Rust's" to pass if "Rust" is in the blacklist.

It is a simple regex of separators. For example, for apostrophes, you specify stem_separator_regex = "[']" in the rule file.

If you do not specify it, or set it to = "" or = "[]" it will not be triggered.

It works after the initial blacklist check is done and only checks stem words extracted with stem_separator_regex against the blacklist.

MichaelKohler commented 1 year ago

Thanks for the PR, I will have a look at it in the next few days.

HarikalarKutusu commented 1 year ago

Actually, thank YOU! It took too much of your time but resolved a (for me) major issue.