gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

What ALA format does GBIF use for processing? #5504

Open Mesibov opened 1 month ago

Mesibov commented 1 month ago

Example: https://www.gbif.org/dataset/2cd6ba56-b0ee-4565-94d8-4016e25c39ae

The source archive has eml.xml, meta.xml, image.tsv and occurrence.tsv but the image and occurrence TSVs have no headers, all data items (including blanks) are quoted and there are other peculiarities.

In processing the occurrence data for processing, does GBIF use the occurrence.tsv as distributed in the source archive, or does GBIF get the data in some other format from ALA? If in other format, what is that other format and is the otherwise-formatted data available?

ManonGros commented 1 month ago

Hi @Mesibov we use the archives available at the endpoint, the format is defined in the meta.xml file. For example: <core encoding="UTF-8" linesTerminatedBy="\r\n" fieldsTerminatedBy="\t" fieldsEnclosedBy="&quot;" ignoreHeaderLines="0" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

As well as the headers for the mapping.

Mesibov commented 1 month ago

@ManonGros, many thanks, good to know. Because the TSVs don't have the usual IPT-created structure, I have to do further processing to make the files usable. Please also note some peculiarities in meta.xml: the 182 fields include

I'm also puzzled by the quoting of all fields in a TSV with no embedded tabs (which shouldn't be there in any case). Removing the quotes in this case drops the file size from 12.2 to 8.9 MB. In another ALA dataset I looked at, the quotes added an unneeded 90 MB.

I appreciate that these are ALA issues and that GBIF takes only what it can use from the ALA datasets, but the ALA issues are hazards for end users of the endpoint data.

MattBlissett commented 1 month ago

Hi Bob,

It's not a requirement for term URIs to resolve, but it is convenient when they do.

It's very much recommended for term URIs to have a structure that outlasts whatever website, documentation system etc is in use, which is why TDWG uses http://rs.tdwg.org/abcd/terms/abcdIdentificationQualifier (with the convenient redirect) rather than the ABCD documentation page. (And why Dublin Core uses purl.org.) If you query rs.tdwg.org as a machine user, you get a different response: curl -LH 'Accept: application/rdf' http://rs.tdwg.org/abcd/terms/abcdIdentificationQualifier.

"taxonRankID" is incorrect, so this isn't a valid Darwin Core Archive (specification).

We generally recommend reading from Darwin Core Archives using a library, although I can see in your case doing data quality analysis you may well need more flexibility, or have existing workflows that aren't based on record-by-record access.

Mesibov commented 1 month ago

@MattBlissett, many thanks. I only started looking at the ALA archives because one publisher's data had changed somewhere along the pipeline publisher > ALA > GBIF. From the endpoint archive it seemed that the changes were made by ALA, but I don't have access to the files ALA took in from the publisher, so I can't be sure. I can script the changes that convert headerless, quoted ALA TSVs to unquoted TSVs with headers if I need to do this again.