gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

allow forced same-dataset occurrence clustering #1056

Open abubelinha opened 5 months ago

abubelinha commented 5 months ago

As discussed in #781 there are situations where repeated copies of the same occurrence within the same dataset should be allowed to be part of a cluster. The most typical example I can see is when repeated copies of the same specimen may have received different catalogNumbers within the same collection,

  1. This can easily happen when a collection was merged into another, and both might have various old specimens in common (so all of them became part of the same dataset, which is now published in GBIF).
  2. Occasionally, specimens are repeated on purpose within the same collection (i.e. to keep several copies of type specimens).
  3. Also when specimens are exchanged between institutions in different years (if previous packages' lists are not carefully checked, new packages may sometimes contain additional copies of specimens already submitted in previous packages).

Replicated occurrences between different datasets (3) are already being targeted by GBIF clustering algorithms. And I guess when that happens, all copies will be included in the cluster (even if one of those datasets has more than one copy).

But I think it is interesting for cases 1 and 2 to let those specimens also be shown as GBIF-detected clusters even when no other datasets are involved. I am aware this can be a problem since "there are many datasets that would just cluster everything (e.g. gut analysis) that brought a technical consideration with cardinalities, and our feasibility of actually calculating these in a timely manner" (sic. @timrobertson100 )

To avoid that and also permit 1 & 2, I suggest the human-curated otherCatalogNumbers relationships to be the only conditions that can trigger a detection of intra-dataset clustering.

I hope that is possible to implement (and not too complicated) to implement. Thanks! @abubelinha