DBD-research-group / BirdSet

A benchmark dataset collection for bird sound classification
https://huggingface.co/datasets/DBD-research-group/BirdSet
BSD 3-Clause "New" or "Revised" License
20 stars 8 forks source link

sort metadata after licences #70

Closed lurauch closed 9 months ago

lurauch commented 10 months ago

licence //creativecommons.org/licenses/by-nc-nd/2.5/ - 67554 //creativecommons.org/licenses/by-nc-nd/3.0/ - 7379 //creativecommons.org/licenses/by-nc-nd/4.0/ - 118984 //creativecommons.org/licenses/by-nc-sa/3.0/ - 68896 //creativecommons.org/licenses/by-nc-sa/3.0/us/ - 1 //creativecommons.org/licenses/by-nc-sa/4.0/ - 415453 //creativecommons.org/licenses/by-nc/4.0/ - 128 //creativecommons.org/licenses/by-sa/3.0/ - 679 //creativecommons.org/licenses/by-sa/4.0/ - 6614 //creativecommons.org/licenses/by/4.0/ -199 //creativecommons.org/publicdomain/zero/1.0/ - 706

From CC, definitions:

Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.

Problems with publishing Datasets:

licence compatibility problem: Many of the CC-licenses are incompatible (see table). This can only be circumvented, by using multiple licenses for different parts of the dataset. Another question is about how to license the derivatives (i.e. the classification models we produce) when those contain multiple conflicting licenses.

Non-Derivative (nd): As stated under "Adapted Material", we are not allowed to translate, alter, arrange or transform the original data. I think it is impossible to comply with the Non-Derivative license, when uploading our data to Hugging-Face

Share-Alike (sa):

Possible Solutions

Instead of pushing to HF, just publish a script, that pulls the original data, converts it and adds it to the local HF-repository. Fair Use under copyright law: When publishing models that were trained on the datasets there should not be any problems. This is because we made transformative changes (from a dataset to a classification model).

We could also ask the authors for permission.

lurauch commented 10 months ago

CGPT on ND

The key point of the NoDerivatives clause is to prevent the creation and distribution of a version of the work that is different in a meaningful or substantial way from the original. The intention is to prevent alterations to the artistic or intellectual content of the work.

In the case of simply changing a file format:

lurauch commented 10 months ago

ND Recordings: 193k SA Recordings: 493k Total: 686k

We would lose 28% of data. We could contact the recording whales and maybe reduce this quickly?

I'm not sure if they are doing this intentionally..

JonasLange commented 10 months ago

About CGPT on ND:

I think the main issue is not the audio file format. The license does not only cover the audio file, but also the metadata (i.e. annotated species). I think that especially the metadata we provide in the HF-repository constitutes a derivative of the metadata provided on xeno-canto.

About SA Recordings:

I think we can still use the SA recordings, if we are clever with licensing. We would need to license each file individually, instead of licensing the entire HF-repo with a single license. This could be done, by adding the license-column to the metadata.

About Solutions:

If we do not want to split up the licenses, but want to have a license for the entire HF-repository, we should indeed try to get the "whales" to loosen their license. On the other hand: 28% is a sizable chunk, but it is not game-breaking. So if we can not get them to loosen their licenses, It might be better just to drop this and save ourselves the trouble.

raphaelschwinger commented 10 months ago

Maybe XenoCanto would be interested in uploading the data to HF? This would reduce their server troubles?

lurauch commented 10 months ago

SA recordings: I agree, the metadata on hf already provides the license. We are good to go here.

Solutions:

Image

E.g. Peter Boesman has 35k recordings, all under the ND license. Only contacting him (if he agrees) would help and reduce our loss by 6%. Just writing a short mail where we explain our situation is quite easy

lurauch commented 10 months ago

@raphaelschwinger regarding xc contact: imo they would not do this (exclusively) since they have a different target group. additionally, adding data by accounts etc. is not feasible on hf

lurauch commented 10 months ago

@raphaelschwinger @Moritz-Wirth @JonasLange

Image

lurauch commented 10 months ago

@Moritz-Wirth we should also check how many bird species are not available anymore, if we remove the ND birds --> class distribution

Image

No missing birds in the tasks