common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Added allowed_symbols_regex, removed disallowed_symbols EO rules #158

Closed stefangrotz closed 3 years ago

stefangrotz commented 3 years ago

Update for Esperanto rule file to prepare rerun of the wiki extraction. It now uses allowed symbols instead of a long list of disallowed symbols. The goal is to avoid having any symbols in the dataset, that are not part of the alphabet.

The regex for the allowed symbols is now:

allowed_symbols_regex="[AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpRrSsTtUuVvZzĉĈĝĜĥĤĵĴŝŜŭŬ;‚'–\. \"\?!«»]"

I will ask around so that people review this PR.@KuboF would you like to help here?

stefangrotz commented 3 years ago

Here is the sheet with a few hundred random sentences from the extract: https://docs.google.com/spreadsheets/d/1RfGlXDw2_HwbhzZXJnjqoLAs1kXdq1MHKgrIwvK8_Dw/edit?usp=sharing

The complete extract contains around 13 000 sentences. A quick first look looks good so far, mainly errors about foreign words, but nothing close to the threshold so far.

stefangrotz commented 3 years ago

I did a quick review and got around 5% error rate. Most errors are foreign words that are irregular, but people will be able to pronounce them. The quality looks a little better compared to the last extraction, mainly because of the missing non-alphabet letters.

MichaelKohler commented 3 years ago

Thanks for your work on this. I think certain foreign words is expected as it's Wikipedia and certain words just appear more than the initial blocklist threshold. So I would say this is all good.

I'm merging this and will start a re-run. I will CC you on the resulting PR in the CV repo.

stefangrotz commented 3 years ago

Your manually triggered extraction has significantly fewer sentences than this example run. Just 7600 sentences. Any idea why this is happening?

MichaelKohler commented 3 years ago

The sample extraction on a PR has the purpose of providing a sample output as fast as possible to do the validation. It doesn't check at all if previous extractions have been done. It for example also doesn't apply the 3 sentence per article rule. It's "apply the rules to as many sentences as possible within 30 seconds" basically. That's also the reason why the sample can't be used as an actual extract.

The re-run however applies the 3 sentences per article rule, and also only runs it on articles created since the last extract (end of 2019).

Given how many sentences we got for English, 7600 sounds reasonable to me. Hope that explains it well enough?

stefangrotz commented 3 years ago

Thanks, I wasn't aware of that. It is an interesting effect that the quick run creates more sentences than the slow one ;)