gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Data source in GBIF downloads #3381

Open millerjeremya opened 3 years ago

millerjeremya commented 3 years ago

Data source is very important when analyzing GBIF data. But determining the source from data downloaded from GBIF is currently more difficult than it needs to be, in my opinion. If one downloads a selection of data as a CSV file, there is a datasetKey field and a publishingOrgKey field, but no obvious way to look up what actual institutions or databases they represent. If one instead chooses to download the data as DarwinCore, the situation is a little better but still quite labor intensive. There is a series of XML text files corresponding to what appears to be these key field codes. These can be opened in and inspected individually in a text editor, and from this one can discover the institution or database that provided the data. It may be that there is a smarter way to view DarwinCore data, but I am not aware of it. For my part, I am interested in classifying the GBIF data that I download into a few categories based on their source. In my experience, the major categories of data in GBIF are natural history collections databases, observation networks, DNA sequence databases, and data extracted from taxonomic literature (i.e., Plazi). I would appreciate efforts to make this quicker and easier to accomplish.

ManonGros commented 3 years ago

thanks @millerjeremya, not really addressing your point but: you can access the metadata of a dataset by using the datasetKey. The datasetKey is the UUID in the dataset URLs (and can also be used in the registry API)

For example, the metadata for "50c9509d-22c7-4a22-a47d-8c48425ef4a7" is available

The same goes for the publishingOrgKey except the URL is a bit different: https://www.gbif.org/publisher/28eb1a3f-1c15-4a95-931a-4af90ecb574d