common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Some wikicode ignored in sentences derived from Wikipedia #154

Closed fabricebg closed 3 years ago

fabricebg commented 3 years ago

In French (at least), there are numerous errors where numerical entities are ignored from sentences. For instance, I noticed the following after listening to about 200 sentences:

Elle se rencontre vers * d'altitude dans la Sierra Nevada de Santa Marta.
Megrahi est hospitalisé le * et meurt le * à la suite de sa maladie.
L'album s'est vendu à environ * exemplaires et est certifié disque de platine.
Le nombre de gardes rouges capturés s'élève à * environ dont * femmes et enfants.
Ustersbach est située dans le Parc naturel d'Augsbourg-Westliche Wälder, à * à l'ouest d'Augsbourg.
Il est né le * à Hlohovec en Slovaquie..
La tournée commença à Glasgow en Écosse le * et se termina le * en Irlande.

or in English:

This story was first reported on * in Go Tenpin magazine.

, where the * are missing numerical values or dates.

Most of the time, this can be traced back to wikicode ignored in Wikipedia texts, e.g. for the 1st example above

Elle se rencontre vers {{unité|2000|m}} d'altitude dans la Sierra Nevada de Santa Marta.

Dates, distances, heights, etc. are thus omitted from the sentences. This creates text that is hard to read and improperly read aloud. It may ultimately corrupt a number of examples, and/or may perturb training on numbers and dates removed from the dataset.

ftyers commented 3 years ago

This issue should probably be copied to cv-sentence-extractor. But yes, that definitely looks like a problem.

MichaelKohler commented 3 years ago

I have transferred it to the extractor repository. Fabrice, thanks for filing this issue. There was a previous issue filed for this, so I'm tempted to mark this as duplicate of #72. We are using the WikiExtractor under the hood, and there is a issue filed on that repository which goes into more detail: https://github.com/attardi/wikiextractor/issues/189#issuecomment-578710824.

So this is not anything we can fix anytime soon. For bulk submissions, such as the Wikipedia extract, we require an error rate below 5% (see https://common-voice.github.io/community-playbook/sub_pages/text.html). This was applied to the Wikipedia exports as well. While I would love to have perfect extracts, this is simply never gonna happen :)

and/or may perturb training on numbers and dates removed from the dataset.

If the expansion of the template would have worked, we would have removed the sentence completely, as we do not allow numbers anyway.

But yes, that definitely looks like a problem.

Mind elaborating?