remove empty Gutenberg books (or failing to be parsed)

common-voice / commonvoice-fr

Tooling for producing French dataset for Common Voice

100 stars 24 forks source link

remove empty Gutenberg books (or failing to be parsed) #148

Closed drzraf closed 3 years ago

drzraf commented 3 years ago

I guess, project-gutenberg.py could be run again to collect more of the available books. But I don't know about the side effects of (re?)importing them again in sentence collector.

I remember about an old discussion (was it in this repo?) about whether fixing typo for sentences in the dataset would unlink/orphan existing clips already recorded and associated with the former sentence. What's the state of this?

lissyx commented 3 years ago

I guess, project-gutenberg.py could be run again to collect more of the available books. But I don't know about the side effects of (re?)importing them again in sentence collector.

we might create duplicated strings, we would require coordination with the (overloaded) common voice team to somehow handle that I guess

I remember about an old discussion (was it in this repo?) about whether fixing typo for sentences in the dataset would unlink/orphan existing clips already recorded and associated with the former sentence. What's the state of this?

we can't fix typos (without creatint a lot of mess) on existing recording, but it should be fine and new recording should pick it up, although we don't really look for multiple recording of the same sentence :).

drzraf commented 3 years ago

we can't fix typos (without creatint a lot of mess) on existing recording, but it should be fine and new recording should pick it up,

ok. Could you explain what happens when someone "signal" a sentence as wrong/buggy? How is the voice-collector expected to react if it see it imported again? Is it a kind of blacklist?

although we don't really look for multiple recording of the same sentence :)

Are you referring to https://discourse.mozilla.org/t/single-sentence-record-limit-feature-release/62198 / https://discourse.mozilla.org/t/echantillons-identiques-valides-plusieurs-fois-par-une-meme-personne/58509/7

lissyx commented 3 years ago

ok. Could you explain what happens when someone "signal" a sentence as wrong/buggy?

It gets into the hands of some moderators, namely @hellosct1 and @Mozinet-fr. I'm not on the list.

How is the voice-collector expected to react if it see it imported again? Is it a kind of blacklist?

This is a question more suited for voice collector codebase, but as much as I recall from discussions with the team, it's just going to be seen as a new sentence (typo fixed) whereas the old sentence remains.