gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Improve agent parser #640

Open muttcg opened 2 years ago

muttcg commented 2 years ago

Also notice that the PID for Reidar Elven is reported as the Wikidata URL, while the Wikidata concept URI would be better, and still his ORCID ID would be even better...

... which is on the GBIF Norway work plan to try to fix, through building an annotator into the node information infrastructure.

Originally posted by @dagendresen in https://github.com/gbif/pipelines/issues/590#issuecomment-995561617

rukayaj commented 2 years ago

We don't need the annotator for the Wikidata concept URI bit, I've fixed it so that we are now publishing "http://www.wikidata.org/entity/x" instead of "https://www.wikidata.org/wiki/x". I think we can't check it's been updated on gbif.org until this IPT bug https://github.com/gbif/ipt/issues/1703 is fixed, but here is a screenshot from the source data preview on the IPT: Screenshot 2021-12-16 at 11 38 08.

Currently if you want to add an ORCID for Reidar, it needs to be emailed to us so we can insert it into our collections management system, where it will get published out to the IPT. You could also use Bionomia to attach the ORCID to Reidar Elven's records.

dagendresen commented 2 years ago

I was thinking more of the last potential step from the Wikidata URI QID --> ORCID ID :-)

To catch this inference from the Wikidata entry for Reidar Elven which is annotated by his ORCID ID --> and here maybe add one more step through an annotator with the purpose to validate this inference -- and store the provenance of the inference (Wikidata information can be changed and can even be deleted) before replacing the QID with the ORCID on the IPT (which I think would be an improvement if we do).

dagendresen commented 2 years ago

In my mind, "we" (or a machine we trust) could "flag" some annotations as "approved" and implement a mechanism where the data values provided from the source publisher are improved/changed before being published on the IPT.

Where the resolver would hold the "approved" data values, and an annotator would hold the provenance evidence for why and where the new data value comes from (and also contribute to keeping the resolver up-to-date with the vetted data value).

dshorthouse commented 2 years ago

Although tangential to the ticket, I noticed in the screenshot that there are both wikidata entity URIs and ORCID ID URIs for what appears to be the same person, separated by pipes. Lars Ove Hansen is identified in the same record as https://orcid.org/0000-0002-6313-0529 | http://www.wikidata.org/entity/Q11983328. This is contrary to the recommended best practice that states, "Recommended best practice is to provide a single identifier that disambiguates the details of the identifying agent." Here, it seems you've provided two. A naive interpreter of this (eg Bionomia or other) might assume that these are two people.

dshorthouse commented 2 years ago

The title of this ticket is, "Improve agent parser" but it's unclear to me what is an agent & what is it that needs parsing. Is this in reference to a need to parse the strings in recordedBy and identifiedBy or to refine how content in recordedByID and identifiedByID ought to be separated then handled/interpreted? If the former, I'd be pleased to collaborate on a shared set of tests & expected outputs for the benefit of developers who create similar code in other languages & wish to know how it stacks-up to others' approaches. We could use this generic file as a starting point https://github.com/bionomia/dwc_agent/blob/master/spec/resources/test_data.txt