Open muttcg opened 2 years ago
We don't need the annotator for the Wikidata concept URI bit, I've fixed it so that we are now publishing "http://www.wikidata.org/entity/x" instead of "https://www.wikidata.org/wiki/x". I think we can't check it's been updated on gbif.org until this IPT bug https://github.com/gbif/ipt/issues/1703 is fixed, but here is a screenshot from the source data preview on the IPT: .
Currently if you want to add an ORCID for Reidar, it needs to be emailed to us so we can insert it into our collections management system, where it will get published out to the IPT. You could also use Bionomia to attach the ORCID to Reidar Elven's records.
I was thinking more of the last potential step from the Wikidata URI QID --> ORCID ID :-)
To catch this inference from the Wikidata entry for Reidar Elven which is annotated by his ORCID ID --> and here maybe add one more step through an annotator with the purpose to validate this inference -- and store the provenance of the inference (Wikidata information can be changed and can even be deleted) before replacing the QID with the ORCID on the IPT (which I think would be an improvement if we do).
In my mind, "we" (or a machine we trust) could "flag" some annotations as "approved" and implement a mechanism where the data values provided from the source publisher are improved/changed before being published on the IPT.
Where the resolver
would hold the "approved" data values, and an annotator
would hold the provenance evidence for why and where the new data value comes from (and also contribute to keeping the resolver up-to-date with the vetted data value).
Although tangential to the ticket, I noticed in the screenshot that there are both wikidata entity URIs and ORCID ID URIs for what appears to be the same person, separated by pipes. Lars Ove Hansen is identified in the same record as https://orcid.org/0000-0002-6313-0529 | http://www.wikidata.org/entity/Q11983328. This is contrary to the recommended best practice that states, "Recommended best practice is to provide a single identifier that disambiguates the details of the identifying agent." Here, it seems you've provided two. A naive interpreter of this (eg Bionomia or other) might assume that these are two people.
The title of this ticket is, "Improve agent parser" but it's unclear to me what is an agent & what is it that needs parsing. Is this in reference to a need to parse the strings in recordedBy
and identifiedBy
or to refine how content in recordedByID
and identifiedByID
ought to be separated then handled/interpreted? If the former, I'd be pleased to collaborate on a shared set of tests & expected outputs for the benefit of developers who create similar code in other languages & wish to know how it stacks-up to others' approaches. We could use this generic file as a starting point https://github.com/bionomia/dwc_agent/blob/master/spec/resources/test_data.txt
Also notice that the PID for Reidar Elven is reported as the Wikidata URL, while the Wikidata concept URI would be better, and still his ORCID ID would be even better...
... which is on the GBIF Norway work plan to try to fix, through building an annotator into the node information infrastructure.
Originally posted by @dagendresen in https://github.com/gbif/pipelines/issues/590#issuecomment-995561617