WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
247 stars 199 forks source link

xeno-canto (bird sounds from around the world) #1553

Open sarayourfriend opened 2 years ago

sarayourfriend commented 2 years ago

Provider API Endpoint / Documentation

https://xeno-canto.org/explore/api (although may need to be avoided, see technical info below)

Provider description

Example search for sardinian warbler, you can see most if not all recordings are CC licensed: https://xeno-canto.org/explore?query=Sylvia%20melanocephala

Licenses Provided

CC generally, most appear to BY-SA-NC, ~though v3 for all of them as far as I can tell~ (this turned out to be wrong, they link to v3 licenses on their terms of service page but in actuality recordings use a diversity of versions). More info on terms of use page: https://xeno-canto.org/about/terms

Provider API Technical info

From the terms of usage page:

Server Resources

xeno-canto runs a server with specifications appropriate for rather intensive use by many users at the same time. Unfortunately the server cannot usually accomodate indiscriminate automated requests such as mass downloads of pages or files. Such use of the site is (actively) discouraged especially if it deteriorates the user experience or if it interferes with site maintenance. Requests for the transfer of large amounts of data, for any use allowed by the license, are of course welcome at the contact address below.

Therefore, we should not use the API. But along with developing some kind of "data dump" process for WordPress/openverse#1608, whatever process we use for that could be used to include some kind of data dump provided by xeno-canto. I will contact them and see if they're open to the possibility of providing us with a data dump (or if a method for ingesting their catalog already exists).

The checklist below is left incomplete until I've verified with xeno-canto that it's possible to ingest their data somehow.

Checklist to complete before beginning development

Implementation

sarayourfriend commented 2 years ago

Sent the email to xeno-canto. Will update here when I get a response :tada:

AetherUnbound commented 1 year ago

Their API documentation has a ton of useful info, including sample responses and query info. Apparently we can query for all recordings within a given month - that seems like a reasonable paradigm for us and would allow us to run a backfill as well! I'm going to craft some queries to try and understand the range of data that's available.

It should be noted though that they specify:

This API can be used without restrictions. However, intensive use would occasionally degrade general web site performance. We have now implemented a rate limit of 1 request per second. You need to take this into consideration and possibly adapt your application to accommodate this change.

AetherUnbound commented 1 year ago

I've confirmed that we can retrieve (and walk through) all results for a given month queries similar to the following:

Thus the best way to set this up is probably as a monthly DAG using the start window as the time to gather recordings. There are pages and so those would be walked through as normal. AFAICT there is no analogue or location for batch_limit.

sarayourfriend commented 1 year ago

Would it be better if we could react out again and work out regular data dumps/at least confirm with XenoCanto that they're okay with us going through the entire API, even at their 1/second rate limit? The Terms of Use leave room for interpretation but I don't want to burn them :sweat_smile:

It's nice that the API allows us to backfill so coherently though, if that ends up being the solution we go with in the end :rocket:

obulat commented 11 months ago

Could we use this dataset: https://www.gbif.org/dataset/b1047888-ae52-4179-9dd5-5448ea342a24?

sarayourfriend commented 11 months ago

That looks great! It's in Darwin Core too, which would set a standard approach for a number of academic/research level sources (#1332 for example). Critical to refer to the DWC documentation (https://dwc.tdwg.org/terms/) so we properly store and use the provided metadata.

Please note that the Bird sound dataset shared on GBIF is a subset of the entire collection that is available on https://xeno-canto.org/?gid=1. Only the recordings by recordists who have given their permission to share recording metadata with GBIF are shared here.

Based on the collection statistics on the Xeno canto frontpage, the GBIF dataset is only missing 10k of the full 80k collection.

I've also checked it out locally, and the CSV includes both the oscilliscope images and the sound recordings. The images are also licensed BY-NC-SA. Should we include them as well and have Xeno canto join Wikimedia Commons as an image and sound provider?