gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Clustering: Add associatedSequences #1003

Open timrobertson100 opened 6 months ago

timrobertson100 commented 6 months ago

https://www.gbif.org/occurrence/4010748380 and https://www.gbif.org/occurrence/4449937675 look to be linkable based on the associated sequences

I expect having the same sequence URL would be sufficient to link without requiring additional dimensions, but we should verify this by checking the usage of the data.

timrobertson100 commented 6 months ago

Some notes:

Ignoring those, a straight join across associatedSequences (i.e. overlooking the fact it needs to be split as a multivalued field) returns 19,032 relationships that span datasets. Of those, 16,936 relationships are not already detected in the clustering.