Open skinkie opened 1 year ago
It has been the same in the original Sentence Collector. In my opinion, it is not a bug but a "feature" - but I also find it counter-intuitive... It really depends on the workflow the community uses:
A single hole in the process is if a spammer creates another account or has another friend to give two votes immediately. But it will be a problem, even if he/she cannot vote for own sentences, only another account would be needed for these to pass...
My 5c...
@HarikalarKutusu when you mention "feed it to the sentence collector" is this still manual labor, or is there an API?
I think you are asking related to this post: https://discourse.mozilla.org/t/contributing-many-sentences-that-contain-not-yet-spoken-words/120185
No, there is no API and it is a manual job. In the old Sentence Collector, we could insert 5-10k sentences at once. In the new one (the write tab), currently you can add a single one, but a multi-sentence feature is planned. The lack of that feature disrupted our workflow for now.
But there is the bulk sentence posting feature, where you send the sentences and the sources as a table in a PR. Multiple people should check them, using a sample at least, and have a low error rate...
For the nonfrequent vocabulary items you mention, I think you/your community should write meaningful sentences and use the process I mentioned (offline-pre-checks). You cannot dump words of course, and for SoTA models it is advised to have recordings > 5 sec, so longer sentences are better.
Other points to take care of (IMHO):
What I have currently 'implemented' is a ChatGPT prompt using the Dutch open lexicon minus the already used words in the current Mozilla download. I am able to create a dictation assignment of 10-14 words that minimizes the number of sentences for that word list and has a mandatory toponym and 'make sense'. As a variant I have themed the dictation for example "holiday". In my perspective this reduces one part of the effort and increases the pluriformity significantly, especially since the current set is heavy loaded with political statements, likely from parliament transcriptions.
Is there any way to download all the current sentences from the project?
If by assignment you mean an assignment to ChatGPT, please read this first: https://discourse.mozilla.org/t/i-think-its-time-to-talk-about-ai-generated-sentences-again/112685
heavy loaded with political statements
Many European corpora used EuroParl corpus at the start of the project.
Is there any way to download all the current sentences from the project?
Yes, here they are, in your locale directory: https://github.com/common-voice/common-voice/tree/main/server/data
If by assignment you mean an assignment to ChatGPT, please read this first: https://discourse.mozilla.org/t/i-think-its-time-to-talk-about-ai-generated-sentences-again/112685
In Dutch schools a weekly assignment is given to students to train writing and spelling. I have written a prompt to make such consistent sentences.
heavy loaded with political statements
Many European corpora used EuroParl corpus at the start of the project.
It does not feel very diverse.
Is there any way to download all the current sentences from the project?
Yes, here they are, in your locale directory: https://github.com/common-voice/common-voice/tree/main/server/data
Thanks, but this seems as dated as the corpus data, no new sentences for months. I don't buy it ;-) https://github.com/common-voice/common-voice/tree/main/server/data/nl
Please continue this discussion on your Discourse topic, as these are not related to your issue in the first post. https://discourse.mozilla.org/t/contributing-many-sentences-that-contain-not-yet-spoken-words/120185
What I have currently 'implemented' is a ChatGPT prompt using the Dutch open lexicon minus the already used words in the current Mozilla download. I am able to create a dictation assignment of 10-14 words that minimizes the number of sentences for that word list and has a mandatory toponym and 'make sense'. As a variant I have themed the dictation for example "holiday". In my perspective this reduces one part of the effort and increases the pluriformity significantly, especially since the current set is heavy loaded with political statements, likely from parliament transcriptions.
Is there any way to download all the current sentences from the project?
Hello! The Common Voice project is not currently able to accept AI generated sentences due to a lack of clarity in their licensing, ethical concerns about origin datasets and quality control issues.
While we continue to monitor and discuss policy around AI generated text resources: currently, we can only accept original sentences written by humans and contributed directly under CC0 licensing, or appropriately licensed existing works.
@jessicarose if one would use a large language model on CC-0 source data only, would that be ok?
FYI, there is also this - freshly baked: https://arxiv.org/abs/2305.17493 (I know, not the same loop)
FYI, there is also this - freshly baked: https://arxiv.org/abs/2305.17493 (I know, not the same loop)
I think our dataset is too sparse to worry about this.
Describe the bug The original provider of a new sentence, is requested to review their own provided entries. This seems counter intuiative.
To Reproduce Steps to reproduce the behavior:
Expected behavior A user provided the sentence should not audit it. Similar to the audio reviews.