common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 64 forks source link

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

Closed CapitainFlam closed 1 year ago

CapitainFlam commented 1 year ago

better french input for better french output ?

Hi ! ⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️

This commit comes from different discussion, that I propose you to have a look to understand the context.

and it is somehow connected to :

TL;DR: I'm trying to improve input quality for french sentence collector, to avoid A/ garbage IN - garbage OUT, and B/ avoid to remove in the end (in the CorporaCreator) the garbage that went through collecting, recording, review... to finally being dropped and not being included in the final release for training batchs. Better to remove it as soon as sentence collector.

What have you done ? 😱

To do so, I duplicated a EN to FR file in sentence collector, and modified it according to a previous job made by Nicolas Panel for CorporaCreator (and sadly not commited).

According to discussion (links here... Did you follow it from the list above ?!), it's recommanded that I create a WIP pull request, to allow everyone to comment, throw tomatos and/or additionnal commits to it.

footnote : it can be hard (...it IS hard !!!) to understand REGEX (REGular EXpressions). Do not hesitate to catch up with https://regex101.com/ to understand and test it.

MichaelKohler commented 1 year ago

Also note https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411, though that probably doesn't matter too much for this here :)

drzraf commented 1 year ago

Any hope to get this in, in one shape or another?

MichaelKohler commented 1 year ago

The Sentence Collector has now been integrated into the Common Voice platform. Therefore I'm archiving this project here. The validation files now live here and I'm sure it would still benefit from the validation rules being added there: https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation. Unfortunately moving this PR over there is way harder than manually recreating it. Would you mind creating a new PR for this? Sorry for the troubles.