common-voice / common-voice

Common Voice is part of Mozilla's initiative to help teach machines how real people speak.
https://commonvoice.mozilla.org/
Mozilla Public License 2.0
3.3k stars 843 forks source link

[BUG] User providing the sentence, is requested to review their own entry #4076

Open skinkie opened 1 year ago

skinkie commented 1 year ago

Describe the bug The original provider of a new sentence, is requested to review their own provided entries. This seems counter intuiative.

To Reproduce Steps to reproduce the behavior:

  1. Provide a new sentence for a language that has nothing to review
  2. Review new entries, notice your own content.

Expected behavior A user provided the sentence should not audit it. Similar to the audio reviews.

HarikalarKutusu commented 1 year ago

It has been the same in the original Sentence Collector. In my opinion, it is not a bug but a "feature" - but I also find it counter-intuitive... It really depends on the workflow the community uses:

A single hole in the process is if a spammer creates another account or has another friend to give two votes immediately. But it will be a problem, even if he/she cannot vote for own sentences, only another account would be needed for these to pass...

My 5c...

skinkie commented 1 year ago

@HarikalarKutusu when you mention "feed it to the sentence collector" is this still manual labor, or is there an API?

HarikalarKutusu commented 1 year ago

I think you are asking related to this post: https://discourse.mozilla.org/t/contributing-many-sentences-that-contain-not-yet-spoken-words/120185

No, there is no API and it is a manual job. In the old Sentence Collector, we could insert 5-10k sentences at once. In the new one (the write tab), currently you can add a single one, but a multi-sentence feature is planned. The lack of that feature disrupted our workflow for now.

But there is the bulk sentence posting feature, where you send the sentences and the sources as a table in a PR. Multiple people should check them, using a sample at least, and have a low error rate...

For the nonfrequent vocabulary items you mention, I think you/your community should write meaningful sentences and use the process I mentioned (offline-pre-checks). You cannot dump words of course, and for SoTA models it is advised to have recordings > 5 sec, so longer sentences are better.

Other points to take care of (IMHO):

skinkie commented 1 year ago

What I have currently 'implemented' is a ChatGPT prompt using the Dutch open lexicon minus the already used words in the current Mozilla download. I am able to create a dictation assignment of 10-14 words that minimizes the number of sentences for that word list and has a mandatory toponym and 'make sense'. As a variant I have themed the dictation for example "holiday". In my perspective this reduces one part of the effort and increases the pluriformity significantly, especially since the current set is heavy loaded with political statements, likely from parliament transcriptions.

Is there any way to download all the current sentences from the project?

HarikalarKutusu commented 1 year ago

If by assignment you mean an assignment to ChatGPT, please read this first: https://discourse.mozilla.org/t/i-think-its-time-to-talk-about-ai-generated-sentences-again/112685

heavy loaded with political statements

Many European corpora used EuroParl corpus at the start of the project.

Is there any way to download all the current sentences from the project?

Yes, here they are, in your locale directory: https://github.com/common-voice/common-voice/tree/main/server/data

skinkie commented 1 year ago

If by assignment you mean an assignment to ChatGPT, please read this first: https://discourse.mozilla.org/t/i-think-its-time-to-talk-about-ai-generated-sentences-again/112685

In Dutch schools a weekly assignment is given to students to train writing and spelling. I have written a prompt to make such consistent sentences.

heavy loaded with political statements

Many European corpora used EuroParl corpus at the start of the project.

It does not feel very diverse.

Is there any way to download all the current sentences from the project?

Yes, here they are, in your locale directory: https://github.com/common-voice/common-voice/tree/main/server/data

Thanks, but this seems as dated as the corpus data, no new sentences for months. I don't buy it ;-) https://github.com/common-voice/common-voice/tree/main/server/data/nl

HarikalarKutusu commented 1 year ago

Please continue this discussion on your Discourse topic, as these are not related to your issue in the first post. https://discourse.mozilla.org/t/contributing-many-sentences-that-contain-not-yet-spoken-words/120185

jessicarose commented 1 year ago

What I have currently 'implemented' is a ChatGPT prompt using the Dutch open lexicon minus the already used words in the current Mozilla download. I am able to create a dictation assignment of 10-14 words that minimizes the number of sentences for that word list and has a mandatory toponym and 'make sense'. As a variant I have themed the dictation for example "holiday". In my perspective this reduces one part of the effort and increases the pluriformity significantly, especially since the current set is heavy loaded with political statements, likely from parliament transcriptions.

Is there any way to download all the current sentences from the project?

Hello! The Common Voice project is not currently able to accept AI generated sentences due to a lack of clarity in their licensing, ethical concerns about origin datasets and quality control issues.

While we continue to monitor and discuss policy around AI generated text resources: currently, we can only accept original sentences written by humans and contributed directly under CC0 licensing, or appropriately licensed existing works.

skinkie commented 1 year ago

@jessicarose if one would use a large language model on CC-0 source data only, would that be ok?

HarikalarKutusu commented 1 year ago

FYI, there is also this - freshly baked: https://arxiv.org/abs/2305.17493 (I know, not the same loop)

skinkie commented 1 year ago

FYI, there is also this - freshly baked: https://arxiv.org/abs/2305.17493 (I know, not the same loop)

I think our dataset is too sparse to worry about this.