common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Question: How is the result quality? #178

Closed HarikalarKutusu closed 2 years ago

HarikalarKutusu commented 2 years ago

Until now I kept myself away from Wiki* and thus this code, but I'm getting out of resources (sigh)...

I scanned some random samples in Turkish "Vikipedi" and found many of them are out of topic for Common Voice, has many foreign names, chemical substance names, short entries giving a list (e.g. a football players games) etc.

Here, I see many tools such as blacklists and/or vocabulary, but as far as I can see that would need a considerable time investment and trial-error to produce good results.

We have around 500k entries on Vikipedi, which could result 1.5 M sentences, but scanning them is nearly impossible with current manpower... And if quality sentences come out, that would solve half of our problems for years to come.

I want to hear from those who used this process:

MichaelKohler commented 2 years ago

Thanks for bringing this up.

Which parts of the rules are most important?

I would suggest to have a look at the DE rules file, we spent quite some time improving it recently and I think it's at a quite good state these days. It also used the best practices.

Can you change a "bad sentence" with a better one and/or exclude that sentence?

No, because we need to guarantee the legal constraints around this. The official script will be run once the rules file is merged, so no changes are possible.

And overall we want < 5% error rate, so overall quality should not be too bad. However there will be complicated sentences slipping through, but in most cases this will be outweighed by the benefit of having quite a lot of new sentences. This however can also be tweaked with the blocklist depending on how many occurances alyou set as threshold. Needs quite some time to get right, but also worth it.

For the rest, I would suggest to post this on Discourse, as you might have more people looking at it there: https://discourse.mozilla.org/c/voice/239 (note that Discourse currently has issues, so it might not load, but they are working on a fix). Additionally I'd like to keep the issues here about improvements or bugs specifically. I will also post the answer above in your Discourse topic once created.

Thanks again!