Closed fabricebg closed 3 years ago
This issue should probably be copied to cv-sentence-extractor. But yes, that definitely looks like a problem.
I have transferred it to the extractor repository. Fabrice, thanks for filing this issue. There was a previous issue filed for this, so I'm tempted to mark this as duplicate of #72. We are using the WikiExtractor under the hood, and there is a issue filed on that repository which goes into more detail: https://github.com/attardi/wikiextractor/issues/189#issuecomment-578710824.
So this is not anything we can fix anytime soon. For bulk submissions, such as the Wikipedia extract, we require an error rate below 5% (see https://common-voice.github.io/community-playbook/sub_pages/text.html). This was applied to the Wikipedia exports as well. While I would love to have perfect extracts, this is simply never gonna happen :)
and/or may perturb training on numbers and dates removed from the dataset.
If the expansion of the template would have worked, we would have removed the sentence completely, as we do not allow numbers anyway.
But yes, that definitely looks like a problem.
Mind elaborating?
In French (at least), there are numerous errors where numerical entities are ignored from sentences. For instance, I noticed the following after listening to about 200 sentences:
or in English:
, where the
*
are missing numerical values or dates.Most of the time, this can be traced back to wikicode ignored in Wikipedia texts, e.g. for the 1st example above
Dates, distances, heights, etc. are thus omitted from the sentences. This creates text that is hard to read and improperly read aloud. It may ultimately corrupt a number of examples, and/or may perturb training on numbers and dates removed from the dataset.