Rainforest Dataset - Githubissues

JonasLange commented 7 months ago

From "Feature embeddings from large-scale acoustic bird classifiers enable few-shot transfer learning": Rainforest Connection Kaggle dataset (RFCX): This is the training data from the 2021 Species Audio Detection challenge, consisting of recordings of Puerto Rican birds and frogs. This data set has weak negative labels. Both birds and frogs are present in the class list; to understand model performance on these taxa, we present results on each taxa separately, and all together. The bird species in the RFCX data are present in the training data for both the Perch and BirdNET models, but most of these species have very limited training data. As of this writing, the median number of Xeno-Canto recordings for these thirteen species is just 17, and only two species have more than 50 recordings (the Bananaquit with 579 recordings, and the Black-Whiskered Vireo with 68 recordings). Thus, these are largely low-data species for these models, and the results for this data set indicate the ability of of the BirdNET and Perch embeddings to separate species ID for under-trained species.

We sould check if we can intregrate the Data into GADME

JonasLange commented 7 months ago

Some Notes about the license

COMPETITION DATA.

"Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.

A. Data Access and Use. You may access and use the Competition Data for non-commercial purposes only, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.

B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.

JonasLange commented 6 months ago

Note that the lables are semi-week. Detection was done by a cross-correlation algorithm and human experts then confirmed or denied the algorithms prediction. Crucially there are probably some unlabeled bird-calls in the dataset. I did not include negative lables and frog calls as "not-a-bird": Negative lables might correspond to other species and frog calls might overlap with unlabeled bird calls.

Edid: I added the false positve lables. To mark them, I added a new column called "sample_type".

DBD-research-group / BirdSet

Rainforest Dataset #95