gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Improve identifiedByID and recordedByID publishing #75

Closed rukayaj closed 2 years ago

rukayaj commented 2 years ago

Publish "http://www.wikidata.org/entity/x" instead of "https://www.wikidata.org/wiki/x" . This is obviously just for the datasets publishing identifiedByID and recordedByID.

rukayaj commented 2 years ago

replace(replace(recordedByID, ',', ' | '), 'https://www.wikidata.org/wiki/', 'http://www.wikidata.org/entity/') "recordedByID", replace(replace(identifiedByID, ',', ' | '), 'https://www.wikidata.org/wiki/', 'http://www.wikidata.org/entity/') "identifiedByID",

Done for the datasets where we are publishing recordedByID and identifiedByID: Mycology herbarium, Oslo (O) UiO https://ipt.gbif.no/manage/resource.do?r=o_fungi Vascular Plant Herbarium, Oslo (O) UiO https://ipt.gbif.no/manage/resource?r=o_vascular Algae collection, Oslo (O) UiO https://ipt.gbif.no/manage/resource.do?r=algae_o Bryophyte Herbarium, Oslo (O) UiO https://ipt.gbif.no/manage/resource.do?r=o_bryophytes Lichen herbarium, Oslo (O) UiO https://ipt.gbif.no/manage/resource.do?r=o_lichens Entomology, Natural History Museum, University of Oslo https://ipt.gbif.no/manage/resource.do?r=o_lepidoptera

Note, here is the code if we need to remove the trailing '|' and additional space as well: TRIM(TRAILING '|' FROM replace(replace(recordedByID, ',', '|'), ' ', '')) "recordedByID", TRIM(TRAILING '|' FROM replace(replace(identifiedByID, ',', '|'), ' ', '')) "identifiedByID",

rukayaj commented 2 years ago

Check that this is all as expected once we can publish recordedByID and identifiedByID from the IPT again (I think there's an IPT bug). This is what it's looking like in the data preview:

Screenshot 2021-12-16 at 11 38 08
rukayaj commented 2 years ago

Note - we SHOULD remove trailing |'s, as there should not be any ordering info in these fields. See:

https://github.com/gbif/pipelines/issues/640

@rukayaj re: ordering of list of URIs. We thought carefully about this when the identifiedByID and recordedByID terms were created. The recommended best practices https://dwc.tdwg.org/terms/#dwc:identifiedByID states, "If a list is used, the order of the identifiers on the list should not be assumed to convey any semantics." And so, I'd strip out all the unknowns that seem to be meant to convey semantics (i.e. ordering). None can be assumed.

rukayaj commented 2 years ago

We should also really serve ORCIDs, and fall back on QIDs for people, and not publish both ORCID and QID for one person.

https://github.com/gbif/pipelines/issues/640

Although tangential to the ticket, I noticed in the screenshot that there are both wikidata entity URIs and ORCID ID URIs for what appears to be the same person, separated by pipes. Lars Ove Hansen is identified in the same record as https://orcid.org/0000-0002-6313-0529 | http://www.wikidata.org/entity/Q11983328. This is contrary to the recommended best practice that states, "Recommended best practice is to provide a single identifier that disambiguates the details of the identifying agent." Here, it seems you've provided two. A naive interpreter of this (eg Bionomia or other) might assume that these are two people.

This is going to be tricky as it will involve some changes with the way MUSIT is constructing their data views, so we are reliant on them... Perhaps I should ask them if @MichalTorma and I can get access to that server + the source code.

dagendresen commented 2 years ago

This is going to be tricky as it will involve some changes with the way MUSIT is constructing their data views

Which is part of why an annotation service would be useful ;-) I would be careful to simply take over the IT development workload for the museum CMS ;-)

rukayaj commented 2 years ago

But the problem is that we are publishing the ORCID as well as the QID for some collectors/identifiers right? So would the annotation service fix the data BEFORE publication? I assumed data would get annotated after publication, and the use case would be more to e.g. add in missing QIDs and link QIDs to ORCIDs or similar.

Yes good point about thereby accidentally volunteering to take over the CMS development... But it would be useful sometimes to be able to do small fixes for ourselves instead of needing to wait for them to have time to do it. Hmm!

dagendresen commented 2 years ago

Agree that access to the CMS is useful and thus to fix the data stream much closer to the source :-D

Overall the museum is itself responsible for publishing their collection datasets in a data quality that is sufficient.

Many data publishers, not only the museum, publish data that I believe there is a huge potential to crowdsource improvement for.

Occurrence records with recordedByID reporting the QID and ORCID ID for the very same person could be caught by a machine and added to the Annotation service where someone (a person or maybe even another machine) could fix --> and approve as a valid data value improvement. Approved data annotations could then flow to the Resolver. And the data stream from the source data publisher dataset could then maybe be improved with data values on the resolver...? (Maybe I am overthinking the mechanism?)

dagendresen commented 2 years ago

Apropos the hope is that the annotations on the Annotator would more often be fixed in the source dataset... but for some datasets, the data curator might be deceased (or changed job).

rukayaj commented 2 years ago

We are now publishing records with the correct URL form, e.g. from https://www.gbif.org/occurrence/1701807338 we now have http://www.wikidata.org/entity/Q937626.

Note - we SHOULD remove trailing |'s, as there should not be any ordering info in these fields.

I think this is also fixed.

We should also really serve ORCIDs, and fall back on QIDs for people, and not publish both ORCID and QID for one person.

I need to talk to MUSIT about this. I am pretty sure we previously told them that it would be ok to publish both, but I'll ask if they can swap it to check for ORCID, publish that if possible, and fall back to QID. I will leave this issue open to keep track of that.

rukayaj commented 2 years ago

We should also really serve ORCIDs, and fall back on QIDs for people, and not publish both ORCID and QID for one person.

Eirik asked to talk about this before MUSIT make the change, but they said it should be possible to do and not a problem. E is back from holiday next week.

rukayaj commented 2 years ago

Eirik wants to continue to harvest both qids and orcids for his portal from the MUSIT export. I have changed the SQL so it just selects orcids when there are multiple IDs (this is a bit of a crazy regex replace, if anyone else can see a better way to do it please let me know @MichalTorma):

    REPLACE( /* "/wiki" => "/entity" */
      REGEXP_REPLACE( /* Trim ' ' and '|' */
        /* MUSIT export has | to separate same person but diff ids and , to separate different people, e.g. "author1-id1 | author1-id2, author2-id1" */
        REGEXP_REPLACE( /* Replace multiple ,s with official | separator */
          REGEXP_REPLACE( /* "wikidata.org/x | orcid.org/x" => "orcid.org/x" */
            REGEXP_REPLACE( /* "orcid.org/x | wikidata.org/x" => "orcid.org/x" */
              recordedByID,
              '((https://orcid.org/\d{4}-\d{4}-\d{4}-\d{4}) \| https://www.wikidata.org/wiki/Q\d+)',
              '\2 '
            ),
            '(https://www.wikidata.org/wiki/Q\d+ \| (https://orcid.org/\d{4}-\d{4}-\d{4}-\d{4}))',
            '\2 '
          ),
          '[ ,]+',
          ' | '
        ),
        '^[ |]+|[ |]+$',
        ''
      ), 'https://www.wikidata.org/wiki/', 'http://www.wikidata.org/entity/'
    ) "recordedByID"
rukayaj commented 2 years ago

This change has been rolled out and it seems to be working ok.