Open sarayourfriend opened 2 years ago
Sent the email to xeno-canto. Will update here when I get a response :tada:
Their API documentation has a ton of useful info, including sample responses and query info. Apparently we can query for all recordings within a given month - that seems like a reasonable paradigm for us and would allow us to run a backfill as well! I'm going to craft some queries to try and understand the range of data that's available.
It should be noted though that they specify:
This API can be used without restrictions. However, intensive use would occasionally degrade general web site performance. We have now implemented a rate limit of 1 request per second. You need to take this into consideration and possibly adapt your application to accommodate this change.
I've confirmed that we can retrieve (and walk through) all results for a given month queries similar to the following:
Thus the best way to set this up is probably as a monthly DAG using the start window as the time to gather recordings. There are pages and so those would be walked through as normal. AFAICT there is no analogue or location for batch_limit
.
Would it be better if we could react out again and work out regular data dumps/at least confirm with XenoCanto that they're okay with us going through the entire API, even at their 1/second rate limit? The Terms of Use leave room for interpretation but I don't want to burn them :sweat_smile:
It's nice that the API allows us to backfill so coherently though, if that ends up being the solution we go with in the end :rocket:
Could we use this dataset: https://www.gbif.org/dataset/b1047888-ae52-4179-9dd5-5448ea342a24?
That looks great! It's in Darwin Core too, which would set a standard approach for a number of academic/research level sources (#1332 for example). Critical to refer to the DWC documentation (https://dwc.tdwg.org/terms/) so we properly store and use the provided metadata.
Please note that the Bird sound dataset shared on GBIF is a subset of the entire collection that is available on https://xeno-canto.org/?gid=1. Only the recordings by recordists who have given their permission to share recording metadata with GBIF are shared here.
Based on the collection statistics on the Xeno canto frontpage, the GBIF dataset is only missing 10k of the full 80k collection.
I've also checked it out locally, and the CSV includes both the oscilliscope images and the sound recordings. The images are also licensed BY-NC-SA. Should we include them as well and have Xeno canto join Wikimedia Commons as an image and sound provider?
Provider API Endpoint / Documentation
https://xeno-canto.org/explore/api (although may need to be avoided, see technical info below)
Provider description
Example search for sardinian warbler, you can see most if not all recordings are CC licensed: https://xeno-canto.org/explore?query=Sylvia%20melanocephala
Licenses Provided
CC generally, most appear to BY-SA-NC, ~though v3 for all of them as far as I can tell~ (this turned out to be wrong, they link to v3 licenses on their terms of service page but in actuality recordings use a diversity of versions). More info on terms of use page: https://xeno-canto.org/about/terms
Provider API Technical info
From the terms of usage page:
Therefore, we should not use the API. But along with developing some kind of "data dump" process for WordPress/openverse#1608, whatever process we use for that could be used to include some kind of data dump provided by xeno-canto. I will contact them and see if they're open to the possibility of providing us with a data dump (or if a method for ingesting their catalog already exists).
The checklist below is left incomplete until I've verified with xeno-canto that it's possible to ingest their data somehow.
Checklist to complete before beginning development
Implementation