common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 51 forks source link

Use best practice in English rules file #156

Closed MichaelKohler closed 3 years ago

MichaelKohler commented 3 years ago

Over the past years there were quite a few changes to the possible rule configurations. The English rules file still uses some older rules that should now be replaced by better approaches, such as replacing disallowed_symbols with allowed_symbols.

As most new rule files get copied from EN, this would improve overall quality quite substantially.

stefangrotz commented 3 years ago

I agree, especially the documentation in the comments that exists in other files, like the German one, would be very useful.

As you said, I think as a start we should switch from disallowed symbols to allowed_symbols_regex, but we should also collect common English abbreviations, that we want to filter. (like "e.g.", "Mr." "Mrs." "Prof.",...). I found a few useful lists about this and I hope that I can write a proposal in the coming days.

Plus it might be good to take a native English speaker into the loop ;)

MichaelKohler commented 3 years ago

Initial PR: https://github.com/common-voice/cv-sentence-extractor/pull/159/files

@stefangrotz apart from the abbreviations, do you have any other ideas?

stefangrotz commented 3 years ago

I would definitely add

matching_symbols = [
  ["„", "“"],
  ["(", ")"]
]

and maybe also even_symbols

MichaelKohler commented 3 years ago

Nice, that's already done :)

stefangrotz commented 3 years ago

oh, my bad...

I found this explanation in the German file very useful and adapted it a little bit:

# Other patterns
#   - Century and others at the beginning of the sentence (circumvents wrongly splitted sentences from WikiExtractor such as "During the 3. Century ..." leading to two incomplete sentences)
#   - Sentence delimiter can only be at the end of a sentence
#   - No words with only one letter (" a.", " a", " a ", "a ")
#   - Mixed upper/lowercase in words (LaSi - mostly chemical elements?)
other_patterns = [
  "^(Century|Other Example)",
  "[\\.|\\?|!].+$",
  "(\\s[A-Za-z]{1}[\\.|\\?|!]*$)|(^[A-Za-z]{1}\\s)|\\s[A-Za-z]{1}\\s",
  "[a-z][A-Z][a-z]",
]

One letter words should be allowed in English, though. So we should delete this line or comment it out.

MichaelKohler commented 3 years ago

Thanks, I've incorporated half of it.

"^(Century|Other Example)"

.. is AFAICS not used in favor of 3rd etc. If you have others I'm happy to add those though.

(^[A-Za-z]{1}\s)

.. would discard all sentences starting with the article "a", such as "A brown fox jumps over the lazy dog", or any sentence starting with the personal pronon "I", such as "I went to school". Excluding A and I explicitely could make sense, but I'm really not sure if that's actually an issue in general.

(\s[A-Za-z]{1}[\.|\?|!]*$)

This would exclude sentences that end in "I", but that IMHO isn't a problem.

MichaelKohler commented 3 years ago

Generally, do you think it would make sense to think about possible abbreviation replacements through replacements?

stefangrotz commented 3 years ago

I believe that this could be a good thing, since sometimes a word only exists in the form of abbreviations. People rarely write things like "Mister", so replacing "Mr." could improve the dataset. I also believe that this could be extremely beneficial to small languages because replacing abbreviations could also increase the number of extracted sentences.

MichaelKohler commented 3 years ago

I've added the replacements. I've also not seen any instances of lowercase sentence starts, so I disallowed that as well.

Any other ideas? If not, I'd say we can do an official review and then create a new blocklist (I would do that outside of this PR though).

MichaelKohler commented 3 years ago

Let's do the blocklist in this PR as well, then we only need to verify it once. I'll play around with a few numbers.

/action blocklist en 80

github-actions[bot] commented 3 years ago

Job started: https://github.com/common-voice/cv-sentence-extractor/actions/runs/1375360327

stefangrotz commented 3 years ago

The job failed during the step "generate blocklist", but I don't see why.

MichaelKohler commented 3 years ago

Took too long and GitHub killed it (there is a limit). I'm running it locally now.

MichaelKohler commented 3 years ago

@stefangrotz I did a review and the error rate is looking good. Can you have a look at https://github.com/common-voice/cv-sentence-extractor/pull/162 and tell me if you have any other suggestions?