Open gbif-portal opened 1 year ago
I recall the reason I opted for cross-dataset only was due to problems with a few datasets that weren't easy to automatically detect. There are host/parasite datasets (gut worm perhaps?) where every record is a sub-sample of the host, identical to another except for a catalog number, and we ended up with things like 18,000 x 18,000 JOINs that didn't complete.
We'll need to design something robust (i.e. will detect future datasets too) that omits those kind of datasets for internal clustering I think.
https://www.gbif.org/occurrence/1144801832 and https://www.gbif.org/occurrence/1144801833 look like duplicates, with catalogue numbers G-G-175631/1
and G-G-175631/2
.
https://data.gbif.ch/gbif-portal/#/?dataDialog=on&GBIFCHID=G-G-175631%2F2&dataTabIndex=0
https://data.gbif.ch/gbif-portal/#/?dataDialog=on&GBIFCHID=G-G-175631%2F1&dataTabIndex=0
http://www.ville-ge.ch/imagezoom/?fif=cjbiip/cjb22/img_126/G00174062.ptif&cvt=jpeg
http://www.ville-ge.ch/imagezoom/?fif=cjbiip/cjb22/img_126/G00174063.ptif&cvt=jpeg
(Edit: deleted as I misread.)
Relates to this issue: https://github.com/gbif/pipelines/issues/563
GBIF Clustering algorithm - compare records within datasets
One of our publisher pointed out that some related records (herbarium duplicata) aren't grouped together because they were published in the same dataset. See this example: https://www.gbif.org/occurrence/1144801832 and https://www.gbif.org/occurrence/1144801762.
Would comparing records within datasets be possible?
Github user: @ManonGros User: See in registry - Send email System: Safari 16.1.0 / Mac OS X 10.15.7 Referer: https://www.gbif.org/occurrence/1144801832 Window size: width 1479 - height 803 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: WARNING datasetKey: f577c9f3-ae71-4278-b6bf-512ba1dfaa21 publishingOrgKey: 43a26bbf-466a-4335-96cc-01d0656c614a