common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Disallowed words, boundaries, case #41

Open jaumeortola opened 5 years ago

jaumeortola commented 5 years ago

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L37

Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word oster-monath in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath") or near a comma (oster-monath,) and it is not detected.

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L42

Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.

jaumeortola commented 5 years ago

The word tokenization issue is a bit more complicated than I said, specially if you want it to be useful for all languages.

The most important thing is that the word tokenization used here must be exactly the same used when generating the disallowed words list. (In addition, if different kinds of apostrophes are converted in the blacklist generation, the same must be done here when extracting the sentences.)

I see these possibilities:

  1. Use only white spaces as a separator. The blacklist generation will be more complex, and it will require more language expertise, but it is still feasible.
  2. Use white spaces plus a few other common separators: comma, semicolon, quotation marks, question marks..., but not hyphens and apostrophes.
  3. Use word boundaries (\b in regular expressions). This is probably too greedy and not good for some languages.

I would choose option 2. The list of separators could be a configuration option for each language.

Hyphens and apostrophes should be considered separators only when they are next to other separators. To implement this, separators should be multicharacter strings, not just single characters.

Tell me what you decide, and I will redo the blacklist accordingly.

jaumeortola commented 5 years ago

We need to clarify the issue of word tokenization. I will make a concrete proposal.

Add to the language configuration file a setting with replacements to be made (in order) before splitting sentences at white spaces. To be general enough, the replacements must be regular expressions. Apostrophe replacements can be included here. For example:

replacements = [
     [ "’", "'"]
     [ "[,;\?!\.]", " "]
     [ "'[,;\?!\.]", " "]
     [ "^['-]", ""]
]

These replacements should be made exactly the same way when extracting sentences and when creating the blacklist.

The words in the blacklist should be case-sensitive. Or at least make it optional:

blacklist_caseinsensitive = false
jaumeortola commented 5 years ago

Looking again into the word tokenization issue, I find that the problem was not "oster-monath"or oster-monath, but l'"oster-monath", which is not detected if the word in the blacklist is just oster-monath. Anyway, this kind of problems will be solved using, as I said, exactly the same tokenization and the same replacements when creating the blacklist and when selecting sentences.