audeering / datasets

Data cards for public audb datasets
https://audeering.github.io/datasets/
Other
0 stars 0 forks source link

Source of cough-speech-sneeze dataset #13

Open Yuening-Ma opened 1 week ago

Yuening-Ma commented 1 week ago

hello, I'm training a model for cough detection and I would like to use cough-speech-sneeze dataset in audeering datasets.

I find the dataset description on this page: Dataset based on the publication of Shahin Amiriparian: “Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., Schuller, B., 2017. CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms, in: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, pp. 340–345. https://doi.org/10.1109/ACII.2017.8273622

I have downloaded the dataset using audb code (many thanks for the data!), however, I would like to know: Is this the original dataset created by the authors of the paper above (Amiriparian et. al.)? Or the authors of audeering have organized and modified the data? I need this infomation so that I can elaborate the data source correctly in my paper. Thanks again!

PS: How should I cite your work if I download the data with audb?

hagenw commented 1 week ago

Good point, for version 1.0.0 of the dataset (which we did not publish with audb), we used the original data and labels, but in version 2.0.0 we added re-annotations for cough and sneeze and removed files, that were marked as bad audio/containing other sound classes, compare https://github.com/audeering/cough-speech-sneeze/blob/main/2.0.0/publish.py.

You don't have to add a citation for audb, but if you want, you could use https://arxiv.org/abs/2303.00645

hagenw commented 1 week ago

BTW, the raw labels of our re-annotation are available at https://github.com/audeering/cough-speech-sneeze/blob/main/2.0.0/annotations/20210412-102437-cough-sneeze/20210412-102437_cough-and-sneeze_annotations-cough_sneeze.csv.

hagenw commented 1 week ago

You can also load the dataset with the original labels:

>>> audb.versions("cough-speech-sneeze")
['1.0.0', '2.0.0', '2.0.1']

So, if you do:

>>> db = audb.load("cough-speech-sneeze", version="1.0.0")

You should get the original data and labels.

Yuening-Ma commented 1 week ago

Much thanks for your very timely reply! You really did a very solid job, the 2.0 version of the dataset I downloaded is quite clean! I will read the code and anno file for more detail.