gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
28 stars 16 forks source link

GBIF Clustering algorithm - compare records within datasets #4565

Open gbif-portal opened 1 year ago

gbif-portal commented 1 year ago

GBIF Clustering algorithm - compare records within datasets

One of our publisher pointed out that some related records (herbarium duplicata) aren't grouped together because they were published in the same dataset. See this example: https://www.gbif.org/occurrence/1144801832 and https://www.gbif.org/occurrence/1144801762.

Would comparing records within datasets be possible?


Github user: @ManonGros User: See in registry - Send email System: Safari 16.1.0 / Mac OS X 10.15.7 Referer: https://www.gbif.org/occurrence/1144801832 Window size: width 1479 - height 803 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: WARNING datasetKey: f577c9f3-ae71-4278-b6bf-512ba1dfaa21 publishingOrgKey: 43a26bbf-466a-4335-96cc-01d0656c614a

timrobertson100 commented 1 year ago

I recall the reason I opted for cross-dataset only was due to problems with a few datasets that weren't easy to automatically detect. There are host/parasite datasets (gut worm perhaps?) where every record is a sub-sample of the host, identical to another except for a catalog number, and we ended up with things like 18,000 x 18,000 JOINs that didn't complete.

We'll need to design something robust (i.e. will detect future datasets too) that omits those kind of datasets for internal clustering I think.

MattBlissett commented 1 year ago

https://www.gbif.org/occurrence/1144801832 and https://www.gbif.org/occurrence/1144801833 look like duplicates, with catalogue numbers G-G-175631/1 and G-G-175631/2.

https://data.gbif.ch/gbif-portal/#/?dataDialog=on&GBIFCHID=G-G-175631%2F2&dataTabIndex=0

https://data.gbif.ch/gbif-portal/#/?dataDialog=on&GBIFCHID=G-G-175631%2F1&dataTabIndex=0

http://www.ville-ge.ch/imagezoom/?fif=cjbiip/cjb22/img_126/G00174062.ptif&cvt=jpeg

http://www.ville-ge.ch/imagezoom/?fif=cjbiip/cjb22/img_126/G00174063.ptif&cvt=jpeg

(Edit: deleted as I misread.)

ManonGros commented 7 months ago

Relates to this issue: https://github.com/gbif/pipelines/issues/563