common-voice / cv-dataset

Metadata and versioning details for the Common Voice dataset
https://commonvoice.mozilla.org/datasets
Mozilla Public License 2.0
141 stars 15 forks source link

How many peoples in all dataset? #33

Closed wntg closed 4 months ago

HarikalarKutusu commented 5 months ago

Hey @wntg, I'm not sure exactly what you are asking, but the client_id field in the metadata ca be used to get an "approximate" value. I say approximate, because it is not an exact value. That number is generated from the browser session (crypto-hash) and kept in the profile, if someone creates a profile. Otherwise, same person can have different cilent_id values generated from different sessions, devices, browsers, or even multiple accounts...

Number of different values for each dataset is given under "users" field in the .json files in this repo.

If you want an overview for a single language in time or an overall value, you can use the visualization tool here: https://metadata.cv-toolbox.web.tr/

If you need more detailed information per language dataset release, you can use this: https://analyzer.cv-toolbox.web.tr/

So according to the first one, there are 330,323 voices in v17.0, but again, this is not exact. It is sum of separate datasets, but a single people could volunteer in multiple datasets.

In short: You cannot get an exact data due to privacy rules in place, just approximations.

wntg commented 4 months ago

Thanks for your reply. This is my need!I try to use client_id to train a big speaker verify model. And I may clear this dataset again and release new label. What do you think of this idea? Any suggestions?

HarikalarKutusu commented 4 months ago

I don't know how these models work, but AFAIK you are not allowed to attempt to identify individual voices in Common Voice.

KathyReid commented 4 months ago

^^ this