gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Publish ArtsObs IDs so that records can be linked to 'duplicates' in Artskart #74

Open rukayaj opened 2 years ago

rukayaj commented 2 years ago

Following a meeting with Gunnhild and Eirik, they have pointed out that there are is a data quality issue in Artskart we can help to improve:

  1. Someone observes a mushroom in the wild, posts the observation to ArtsObs and sends it (a sample?) to NHM
  2. NHM adds it to MUSIT in the Mycology Herbarium dataset, and adds the ArtsObs observation number to the MUSIT record
  3. We publish it to GBIF. Currently we are not publishing the ArtsObs number, but the suggestion is that we start doing that.
  4. Artskart harvests our data, and ends up showing a 'duplicate' on the map. This is ok, but the next point makes it a problem:
  5. The species name (or coordinates for example) gets updated in MUSIT and not in Artskart - now there is a divergence which could be confusing

Worth noting that 25% of one group of funghi have changed name in the last few years, it is definitely quite a common occurrence.

Regardless, I think that it would be useful for us to have some kind of link to the Artsobs record. And Artsobs are also published in GBIF.

We discussed the possibility of using resourceRelationship to make this link. It should be possible on our side (just make a new database connection to the table and select the Artsobs id field + the occurrenceID field and add some information explaining the relationship. If the Artsobs ID isn't a proper identifier maybe it should go in as a measurementOrFact or dynamicProperties?

Gunnhild will arrange a meeting with me, Eirik, Knut Anders and herself later on, probably in January, to discuss the best way to solve this.

To do before that: Check that the Artsobs id field (called "ARTSOBS" I think) is actually IN the MUSIT database view - otherwise contact MUSIT about this. Also I remember reading some github issue about best practices for linking 'duplicate' records, try find that again.

dagendresen commented 2 years ago

For the specimen record (from MUSIT) I think that the ArtsObsID could be mapped to dwc:recordNumber and the occurrenceID from ArtObs could be mapped to dwc:associatedOccurrences

And that (also) using ResourceRelationship is a very good idea.

PS: I also think that the ArtsObs record is a distinct and different Occurrence record (sensu Darwin Core / DwCA model) from the MUSIT specimen record. However, in a better data model, both data records would share the same occurrenceID and only the specimen would be identified by its materialSampleID rather than (as in the current data model) require a new occurrenceID ;-)

rukayaj commented 2 years ago

The ARTOBSNUMMER field is present in the MUSIT export, I have added it to recordNumber for our MUSIT datasets but we still need to discuss with Knut where they want to extract it from. We can also add a resourceRelationship record for them.

dagendresen commented 2 years ago

Possibly use http://rs.tdwg.org/dwc/terms/associatedOccurrences

rukayaj commented 2 years ago

We decided to use https://dwc.tdwg.org/terms/#dwc:associatedOccurrences so that the private record number for collectors still has its own field. We will use 'same as', so the field value is e.g.:

"same as": "https://www.artsobservasjoner.no/Sighting/20730716".

I need to update all our MUSIT datasets with this, which I can do tomorrow.

dagendresen commented 2 years ago

The occurrence at https://www.artsobservasjoner.no/Sighting/20730716 looks to be the same as the occurrence https://www.gbif.org/occurrence/1949306038

occurrenceKey = 1949306038 GBIF URL = https://www.gbif.org/occurrence/1949306038 occurrenceID = urn:uuid:a6c68d9a-db53-4095-9785-4ea998f99414

Would it thus not be more appropriate to report the same occurrenceID = urn:uuid:a6c68d9a-db53-4095-9785-4ea998f99414 also for the MUSIT collection specimen -- than to report the URL https://www.artsobservasjoner.no/Sighting/20730716 as associatedOccurrence.

Apropos "sameAs" a Human Observation would obviously NOT be the sameAs a Preserved Specimen as the one thing is a BFO occurrent (event) and the other thing is a BFO continuant (even an independent continuant) ;-)

rukayaj commented 2 years ago

Yes we talked about using the artsobs occurrenceID, but it's going to take time to add them manually into MUSIT. It will be a bit tricky to make them report the same occurrenceID as artsobs at any rate, because we're using the dwc triplet format for our MUSIT occurrenceIDs. And we shouldn't really change occurrenceIDs once they are published anyway, right? It could go in the otherCatalogNumbers field, which is also where we publish the uuid occurrenceIDs which are on the specimen sheets.

Would it thus not be more appropriate to report the same occurrenceID = urn:uuid:a6c68d9a-db53-4095-9785-4ea998f99414 also for the MUSIT collection specimen -- than to report the URL https://www.artsobservasjoner.no/Sighting/20730716 as associatedOccurrence.

Isn't this the equivalent of doing sameAs, if they're given the same occurrenceIDs? The resolver would certainly confuse them as the same thing.

Maybe "same observation as" would then be a better descriptor?

dagendresen commented 2 years ago

sameAs on a GBIF record would indicate that the entity reported by the record is the sameAs the entity reported by another record? And not really say anything about that the occurrenceID name string could be the sameAs another occurrenceID name string ... ;-) A Human Observation is by its nature obviously very different from a Preserved Specimen ;-)

I think it would be a great opportunity lost to not report for the specimen the same occurrenceID as declared by Artsobs when we know that the specimen is from the same Occurrence Event!

I think that reusing the same persistent IDENTIFIER most certainly must rank above keeping the occurrenceID strings stable - when such relationships are identified - and if the Artobs occurrenceID can be trusted to be a persistent IDENTIFIER.

Apropos, this use case is a very central example for what an annotation system could provide... ;-)

rukayaj commented 2 years ago

Isn't it possible for us to both keep the occurrenceID strings stable and clearly show the relationship between the occurrences using other means? Like otherCatalogNumbers, which (as far as I remember) the GBIF clustering algorithm uses to cluster records together?

PS A separate debate, but I don't see how an annotation system would change the occurrenceIDs for these record? They could be annotated with the other occurrenceID, but this would not change the data we are publishing. We would still have to go through the process of importing the changes into MUSIT, and changing the data publication workflow so that we can change and republish the relevant records with new occurrenceIDs.

Sorry, I should have invited you to the meeting about this!

dagendresen commented 2 years ago

I think that the GBIF system has a very very serious flaw in that specimens are forced to be published as if they are occurrences! If we treat a GBIF data record as a very denormalized view of many different things - I think that we should much rather not publish any occurrenceID when we have none.

However, I think it is fair to infer that the entity reported by the denormalized GBIF record is what is declared by the basisOfRecord. When the basisOfRecord = PreservedSpecimen the entity described is the specimen. And when the basisOfRecord = HumanObservation the entity described is the species occurrence data point event thingy.

When the basisOfRecord = PreservedSpecimen, I think it is more than fair to infer that the entity reported by the GBIF data record is identified by the materialSampleID reported. And that when the basisOfRecord = HumanObservation the entity is identified by the occurrenceID reported. The nonsense for me starts when ocuurrenceID is mandatory and required to be unique (!!) for records with basisOfRecord = PreservedSpecimen.

I consider the Darwin Core triplets in the occurrenceID attribute for specimens to be ONLY a placeholder - to meet the GBIF requirement of reporting something in this attribute! If like here, we have a REAL occurrenceID that is nothing less than fantastic!

Vaguely indicating the relationship using string attributes like catalogNumber and otherCatalogNumber is a very poor cousin of doing the right thing - which is reusing the persistent identifier!

One "other means" to declare the relationship would be to use an annotation system. But first, we need to have an annotation system, and second, there needs to be an uptake for the data annotation approach in our community, and then third the annotation system we use needs to be integrated in solutions the community implements... such as a Digital Extended Specimen.

I do not see an annotation system as collecting user feedback information that should be imported "back" into the source dataset -- at all! Even though this is of course one (limited utility) use case that such an annotation system could meet.

rukayaj commented 2 years ago

I just had a quick chat with Eirik about the possibility of getting the Artsobs uuid style occurrenceIDs imported into MUSIT, he suggested (and I agree with him) that it would be the perfect time to add these new IDs to our records when we swap to the new collections management system, which he thinks will happen next year.

Practically speaking, we both think it's going to be tricky to get MUSIT to add the uuids in now - we would have to probably manually add them one by one into the ARTSOBSID field. I am also wary of replacing the non uuids which are already there if we do it manually, because there could be some copy and paste errors. If you think it's important we do it earlier than that we can come up with another solution, or we could continue work on the annotator and try add it from that angle. I personally think it is ok to have the other ID, which at least links nicely with the URL and practically speaking any (human) looking at the record can see the relationship between them.

But as a step 1 we are going to have to add the other ID to associatedOccurrences anyway, because Artskart want to use that one for making the link between records explicit in their UI, so I'm going to go ahead with that.

... second, there needs to be an uptake for the data annotation approach in our community, and then third the annotation system we use needs to be integrated in solutions the community implements...

Yes, I see what you mean and I agree it would be useful to show this link as an annotation. But we can programmatically post this data into our annotation API (which is up and running http://annotate.gbif.no/) right now, the problem is that we don't have much control over your second and third points.

It strikes me that this would all be solved if we could just publish these records as materialSamples or specimens or something instead of occurrences!

dagendresen commented 2 years ago

Sound like a very good strategy to do this with the move to the new CMS. And also securing the same GBIF occurrenceKey number in such a transition of occurrenceIDs would need some attention :-)

And I also agree that the "other URL" is useful! I am just arguing that the most appropriate occurrenceID for the specimens would be reusing the REAL occurrenceID declared by ArtsObs - when we in this brilliant example have this PID available. I repeat brilliant use case :-) (It makes me sooo happy to discióver that ArtsObs has such PIDs as occurrenceID!)

rukayaj commented 1 year ago

Discussed a little bit in one of the GBIF Node zoom meetings. There is now a basic link between MUSIT specimen records and artsobs sightings:

https://www.gbif.org/occurrence/2332446705 We’ve got specimen records with associatedOccurrences i.e. "same occurrence as": "https://www.artsobservasjoner.no/Sighting/11985675"

From Knut: Typical case for us at UMB marine invertebrates (pre 2020, when we started using musit): a "artsprosjekt" has deposited their material to us, and delivered the list of records to Artsdatabanken who put it in artskart (->GBIF). We catalog the samples and put it in musit (->GBIF). Some specimens are DNAbarcoded, and IBOL shares the data in GBIF. So that's three records of the exact same 1 specimen. In the associatedOccurrences