Duplicate samples - Githubissues

zahrakhanjani128 commented 1 year ago

Hello, we figured out some samples look duplicated in the dataset. We wonder if they are completely the copied version of each other? For example, PA_E_1035160.flac and PA_E_1018196.flac, and more ... Have you used a type of oversampling? Any specific reasons behind these duplicates? Thank you!

TonyWangX commented 1 year ago

Hi, files like PA_E_1035160.flac and PA_E_1018196.flac are replayed and recorded with different devices and in different rooms. Hence, the speech contents in the two files are the same, but they are not exactly the same. You can check the waveform values.

Meta data can be found in https://github.com/asvspoof-challenge/2021#evaluation-tools-using-the-full-set-of-keys-and-meta-labels

For the two files above:

...
PA_0022 PA_E_1035160 R6 M1 d4 r4 m1 s4 c4 spoof notrim hidden
...
PA_0022 PA_E_1018196 R2 M2 d4 r2 m2 s3 c3 spoof notrim hidden

zahrakhanjani128 commented 6 months ago

Many thanks for the clarification

asvspoof-challenge / 2021

Duplicate samples #26