Closed stefangrotz closed 3 years ago
Here is the sheet with a few hundred random sentences from the extract: https://docs.google.com/spreadsheets/d/1RfGlXDw2_HwbhzZXJnjqoLAs1kXdq1MHKgrIwvK8_Dw/edit?usp=sharing
The complete extract contains around 13 000 sentences. A quick first look looks good so far, mainly errors about foreign words, but nothing close to the threshold so far.
I did a quick review and got around 5% error rate. Most errors are foreign words that are irregular, but people will be able to pronounce them. The quality looks a little better compared to the last extraction, mainly because of the missing non-alphabet letters.
Thanks for your work on this. I think certain foreign words is expected as it's Wikipedia and certain words just appear more than the initial blocklist threshold. So I would say this is all good.
I'm merging this and will start a re-run. I will CC you on the resulting PR in the CV repo.
Your manually triggered extraction has significantly fewer sentences than this example run. Just 7600 sentences. Any idea why this is happening?
The sample extraction on a PR has the purpose of providing a sample output as fast as possible to do the validation. It doesn't check at all if previous extractions have been done. It for example also doesn't apply the 3 sentence per article rule. It's "apply the rules to as many sentences as possible within 30 seconds" basically. That's also the reason why the sample can't be used as an actual extract.
The re-run however applies the 3 sentences per article rule, and also only runs it on articles created since the last extract (end of 2019).
Given how many sentences we got for English, 7600 sounds reasonable to me. Hope that explains it well enough?
Thanks, I wasn't aware of that. It is an interesting effect that the quick run creates more sentences than the slow one ;)
Update for Esperanto rule file to prepare rerun of the wiki extraction. It now uses allowed symbols instead of a long list of disallowed symbols. The goal is to avoid having any symbols in the dataset, that are not part of the alphabet.
The regex for the allowed symbols is now:
I will ask around so that people review this PR.@KuboF would you like to help here?