gbif / crawler

The crawling pieces - ws, cli, coordinator
Apache License 2.0
4 stars 3 forks source link

BiocaseMetadataSynchroniser: Ignore unknown namespaces #46

Closed snsb-seifert closed 2 years ago

snsb-seifert commented 3 years ago

When updating the endpoints from a Biocase installation unknown schema/dialects should be ignored and not harvested.

In the code it is only checked if the Biocase installation is supporting abcd_1.2 and if not abcd_2.06 is assumed. But we have many different abcd versions, like abcd_2.1.

https://github.com/gbif/registry/blob/590c8812801f5bd8fef60e417ae161e0ab1c1327/registry-metasync/src/main/java/org/gbif/registry/metasync/protocols/biocase/BiocaseMetadataSynchroniser.java#L123

For example https://www.gbif.org/dataset/64dabd3c-4f34-4520-b9dd-d227a0bf1582 is assumed to be abcd_2.06 but is abcd_2.1. In abcd_2.1 the xml-tag "FileURI" is rename to "fileURI". So no multimedia objects will be found at harvest.

ahahn-gbif commented 3 years ago

Note: this is related to a helpdesk conversation ("Data Publisher SNSB - Update") between Feb 2 and 17, discussing the missing images of this dataset. There is no abcd_1.2 version of this datset, registered versions are abcd_2.06 and abcd_2.1.

ahahn-gbif commented 3 years ago

Additional diagnosis: "It seems that the crawler tries to harvest the ABCD_2.1 archive using the ABCD_2.06 parser. I don't know exactly if the parser is case sensitive, but as the archives are xml-files it should be case sensitive. The Multimediaobject fileURI tags changed between ABCD_2.06 and ABCD_2.1 "FileURI" vs "fileURI". That would explain that no image links are found. But even using the wrong parser some tags can be parsed and it would look like the harvest was successful, but it is not. The solution would be to reject unknown archive types like ABCD_2.1 and try the other endpoints instead." (17.2.21)

Alternative, short-term fix suggested: de-register the unsupported endpoint.

ahahn-gbif commented 2 years ago

The same case occurs for https://registry.gbif.org/installation/603065ae-f762-11e1-a439-00145eb45e9a, with 4 endpoints and, to date, 37 datasets. Even though GBIF-interpretable archives for ABCD2.06 exist, the crawler picks up on the newer version 2.1 that is not handled by GBIF.

Suggested solution: allow for a generic "deny" clause / list for unsupported schemas in the crawling/ingestion workflow to prevent them to be used in the process. With ABCD in particular, we do not offer support for schema versions > ABCD2.06.

The alternative, short-term fix of manually de-registering unsupported endpoints has been used so far, but keeps causing issues (entries are re-generated during endpoint synchronization), and is becoming more unwieldy with added datasets.

MattBlissett commented 2 years ago

@ahahn-gbif The fix is on UAT, I have synchronized https://registry.gbif-uat.org/installation/6038e54e-f762-11e1-a439-00145eb45e9a/endpoint and https://registry.gbif-uat.org/installation/603065ae-f762-11e1-a439-00145eb45e9a

I think the remaining issues for ZFMK are related to identifiers.

We'll deploy this to production first thing on Monday.

ahahn-gbif commented 2 years ago

Endpoint synchronization appears to work after the change, thanks - confirmed, closing.

Just for the record: the datasets that appear to have issues with, possibly, duplicate identifiers are these:

ahahn-gbif commented 2 years ago

Sorry, cannot close in this project - please consider it done.