asvspoof-challenge / 2021

ASVspoof 2021 Baseline Systems
198 stars 76 forks source link

Missing data in ASVspoof 2019 LA track? #18

Closed nguyenvulong closed 1 year ago

nguyenvulong commented 1 year ago

It has been 4 years and I hope that someone would realize this too: the line counts listed in cm_protocols do not match with the number of .flac files (in dev and eval) sub-datasets. Please see the screenshot below 👇

(while I'm fine with data missing, my bigger concern is that: did this inconsistency cause any labeling issue, e.g., audio x is spoofed instead of bona fide because of this. I hope not)

image
TonyWangX commented 1 year ago

Hi @nguyenvulong, thanks for the message.

You know that ASVspoof means ASV and spoof.

Have you checked the file list for ASV? The "missing numbers" are the files listed in LA/ASVspoof2019_LA_asv_protocols/.eval..trn.txt. They are for ASV enrollment and not used for spoofing CM.

$: cd LA/ASVspoof2019_LA_asv_protocols
$: cat *.dev.*.trn.txt | awk '{print $2}' | tr ',' '\n' | wc -l
142

$: cat *.eval.*.trn.txt | awk '{print $2}' | tr ',' '\n' | wc -l
696

24986 - 24844 = 142

71933 - 71237 = 696

The training set does not have ASV enrollment files.

nguyenvulong commented 1 year ago

Very nice. Thanks for your hard work. 👏