Open jaumeortola opened 5 years ago
The word tokenization issue is a bit more complicated than I said, specially if you want it to be useful for all languages.
The most important thing is that the word tokenization used here must be exactly the same used when generating the disallowed words list. (In addition, if different kinds of apostrophes are converted in the blacklist generation, the same must be done here when extracting the sentences.)
I see these possibilities:
I would choose option 2. The list of separators could be a configuration option for each language.
Hyphens and apostrophes should be considered separators only when they are next to other separators. To implement this, separators should be multicharacter strings, not just single characters.
Tell me what you decide, and I will redo the blacklist accordingly.
We need to clarify the issue of word tokenization. I will make a concrete proposal.
Add to the language configuration file a setting with replacements to be made (in order) before splitting sentences at white spaces. To be general enough, the replacements must be regular expressions. Apostrophe replacements can be included here. For example:
replacements = [
[ "’", "'"]
[ "[,;\?!\.]", " "]
[ "'[,;\?!\.]", " "]
[ "^['-]", ""]
]
These replacements should be made exactly the same way when extracting sentences and when creating the blacklist.
The words in the blacklist should be case-sensitive. Or at least make it optional:
blacklist_caseinsensitive = false
Looking again into the word tokenization issue, I find that the problem was not "oster-monath"
or oster-monath,
but l'"oster-monath"
, which is not detected if the word in the blacklist is just oster-monath
.
Anyway, this kind of problems will be solved using, as I said, exactly the same tokenization and the same replacements when creating the blacklist and when selecting sentences.
https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L37
Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word
oster-monath
in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath"
) or near a comma (oster-monath,
) and it is not detected.https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L42
Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.