gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

GBIF human-curated clusters based on dwc:otherCatalogueNumbers #781

Open abubelinha opened 2 years ago

abubelinha commented 2 years ago

I come from portal-feedback#3973

I was suggesting a pure dataset+catalogueNumber combination as a way to permit human-curated clusters. As long as we follow strict rules to fill in those dwc field values with data taken from GBIF, I think those links could be trustable. (perhaps the catalogueNumber part may be also standardised when clustering, as suggested in #777)

i.e., we could use triplets of institutionCode:collectionCode:catalogNumber to fill in values in dwc:otherCatalogNumbers, like this: SANT:SANT:44553

GBIF just needs to find and group the occurrences which match their dwc:otherCatalogNumbers values in order to put them in a given cluster (together with the occurrence pointed by those values).

In theory, it might happen that different institution/collection pairs share identical codes, although I don't think it will really happen. But if that's an issue we could just substitute the first two parts of the triplet (institutionCode:collectionCode pair) by the dataset_key provided by GBIF. Like this: dwc:otherCatalogNumbers = 1c334170-7ed1-11df-8c4a-0800200c9a66:44553

Of course that means data curators must first check GBIF in order to know dataset_key values for the collections they want to link. But that is trivial using the api, and only needed once per collection (as I understand these gbif dataset keys are pretty stable).

Once we know them its very easy to populate our dwc:otherCatalogNumbers with all the info we can have about specimens we have exchanged with other institutions (or taken in loan from them, if our GBIF dataset is related to a scientific paper which relays on specimens examined from different collections).

These clusters would work no matter how those different institutions have filled in locations ("Spain" vs "España"), collectors ("T.Robertson", "Robertson, T.") or even taxonomic names (which may have been corrected in their databases, or in my data paper ... so all datasets in the cluster can benefit from knowing about each other).

I must say all this is just an idea because my institution is not filling in dwc:otherCatalogNumbers yet. Just because of that uncertainity about the best way to construct them (triplets? pairs?). Look at comments in dwc:otherCatalogNumbers definition:

Definition A list (concatenated and separated) of previous or alternate fully qualified catalog numbers or other human-used identifiers for the same Occurrence, whether in the current or any other data set or collection.
Notes Recommended best practice is to separate the values in a list with space vertical bar space ( | ).
Examples FMNH:Mammal:1234NPS YELLO6778 | MBG 33424

As you can see, the 1st example is a triplet separated by :, whereas the 2nd and 3rd are pairs separated by spaces. Not to say that 1st and 2nd are not separated by | (the suggested separator). That's not a good example for a standardized way of doing it.

As soon as GBIF clarificates a way to fill this in order to permit reliable clusters, we would indeed start doing it. The good thing is that we can concatenate as many values as needed (no need to substitute current ones). Perhaps GBIF could accept using dataset_key+catalogNumber as an aditional pair for that field? So clusters might be based only on values which use that kind of codification. I am assumming that GBIF dataset_key may be easily recognized by regex, and that we use a known separator (+ or whatever) prior the catalogNumber part.

Thanks a lot @abubelinha

abubelinha commented 2 years ago

BTW, is this the correct github/gbif place to raise this question? I have seen clustering-related issues in both portal-feedback and pipelines

timrobertson100 commented 2 years ago

Thanks @abubelinha - this makes a lot of sense (and this repository is a good place to record it)

The rules as described in the blog refer to identifiers overlap and otherCatalogNumber is included in that overlap, but currently, it requires other things (e.g. same species and location) to also align to trigger a match - this was simply a defensive approach to avoid too many false matches for those using e.g. running integers for IDs.

I'll run some queries shortly and try and summarize how it is used and how well we could rely on ´otherCatalogNumber`. I think it would be preferable to use the IC:CC:CN approach rather than the GBIF dataset key to avoid this being too GBIF-specific.

timrobertson100 commented 2 years ago

This post serves to document some basic data research on the potential for otherCatalogNumbers to inform clusters.

Using the following tables as a basis:

CREATE TABLE tim.cat STORED AS parquet AS
SELECT gbifid, datasetkey, institutionCode, collectionCode, catalogNumber, otherCatalogNumbers, kingdom, scientificName, acceptedScientificName, taxonKey, acceptedTaxonKey
FROM prod_h.occurrence
WHERE basisOfRecord='PRESERVED_SPECIMEN';

CREATE TABLE tim.other STORED AS PARQUET AS
SELECT gbifid, datasetkey, otherCatalogNumber
FROM tim.cat LATERAL VIEW explode(otherCatalogNumbers) o AS otherCatalogNumber
WHERE size(otherCatalogNumbers)>0;

There are 30,073,790 records from 1003 datasets listing 30,337,534 otherCatalogNumbers. SELECT count(DISTINCT gbifID), count(DISTINCT datasetKey), count(*) from tim.other;

Looking at the format of the identifiers, we anticipate some delimitation of institutionCode, collectionCode, catalogNumber.

SELECT t.colonCount, count(*) AS freq
FROM 
  (SELECT  LENGTH(regexp_replace(otherCatalogNumber,'[^:]','')) AS colonCount
  FROM tim.other) t
GROUP BY t.colonCount ORDER BY t.colonCount;  
Delimiter Frequency  Number of records  Notes
: 0 19009829 Expected
: 1 3585636 institutionCode : catalogNumber
: 2 4426228 institutionCode : collectionCode : catalogNumber
: 3 93736
: 4 1656576 e.g. LSID, urn:catalog:RBINS:IG:17225 (48 datasets using this format)
: 5 1561015 LSID with version (and ; delimited identifiers - publisher has been notified)
: 6 34
: 8 4400
: 9 1
: 10 28
: 12 21
: 14 29
: 20 1
- 0 30160859 Expected
- 1 170952 institutionCode - catalogNumber
- 2 4438 institutionCode - collectionCode - catalogNumber
- 3 1201
- 4 22
- 5 59
- 7 2
- 8 1

Next up, I'll look at the performance of some of the joins.

timrobertson100 commented 2 years ago

(this needs checked, but leaving here as I pause this for a couple of days)

Assuming the catalogNumber itself is tokenized (e.g. AB12 or MAMM:AB12) we can get a sense of how many records might link if we materialize the ways catalogNumbers are commonly used:

SELECT count(DISTINCT m.gbifID) AS records, count(DISTINCT m.datasetKey) AS datasets
FROM tim.other o 
  JOIN (

    SELECT t.gbifID, t.datasetKey, code FROM
      (SELECT 
      gbifID, datasetKey,
      array(
        concat_ws(':', institutionCode, catalogNumber),
        concat_ws(':', institutionCode, collectionCode, catalogNumber),
        concat_ws('-', institutionCode, catalogNumber),
        concat_ws('-', institutionCode, collectionCode, catalogNumber),
        concat_ws(':', 'urn:catalog', institutionCode, collectionCode, catalogNumber)
      ) as codes
    FROM tim.cat) t LATERAL VIEW explode(codes) x AS code) m 
  ON m.code=o.otherCatalogNumber
WHERE o.datasetKey != m.datasetKey AND o.gbifID<m.gbifID;

> 3,439,777 records across 402 datasets
timrobertson100 commented 2 years ago

I've been exploring algorithms here, and want to leave a note on why this gets difficult. The algorithm I currently have ignores the delimiters (,:/ etc) but matches otherCatalogNumber on record 1 with IC:CN and IC:CC:CN on record 2.

UCLA W57-41 and UCLA : W57-41 is a nice example of where otherCatalogNumber alone can be matched to seemingly good effect(?), detecting a good link of a reidentified record.

However, TNHC2799 and TNHC:2799 illustrate where this can go wrong (I think).

There is so much variation in how codes are constructed that I think we need more than catalogNumber alone to make the match. Perhaps a locality, or a date or a person in addition to the code overlap?

Alternatively, we could explore what a stricter match using only the IC:CC:CN version (no IC:CN) might yield. One would presume a triplet overlap of codes was a strong signal of a relationship - perhaps adding a Kingdom scope (although presumably there are cases that would cross kingdoms related to e.g. host relationships)

ManonGros commented 2 years ago

@timrobertson100 Yes I think you are right, the catalogue number overlap alone isn't enough to infer the link.

If possible at all, I think excluding the matches that have different recordedBy would help clean up some false positive. Like we already do the matches that have conflicting dates or location.

Otherwise having an identifier overlap and same recorder or date is a good idea. The locality might be really difficult to compare but perhaps the state province when filled? Would that be possible?

abubelinha commented 2 years ago

Sorry but I feel I asked for something a little bit different of what you are trying to do.

Probably my misunderstanding of how the clustering works led me to ask for a feature which is not easy to implement here.

These clusters look to me as an example of computer intelligence working to find out groups of occurrences that have a high probability of being duplicates of each other.

But I want to stress the "human-curated" words in the title. What I was asking is giving humans (database curators) a chance to "force" items into a cluster, no matter what computers say about that. Let curators do the work of finding out if two gbif occurrences are the same or not. And let them have a way of doing it, even if this may lead to some wrong links (well, let's see how many of them are finally wrong).

That's the reason I suggested to use dataset_key.

I think it would be preferable to use the IC:CC:CN approach rather than the GBIF dataset key to avoid this being too GBIF-specific.

I agree @timrobertson100 , this is absolutely too GBIF-specific. But who cares? This clustering functionality is a GBIF-specific test service. And as you are labeling clusters according to which matching algorithm was used to produce them, why not having one label which prompts "human-forced-cluster" (trust it at your own risk).

As a human curator, taking care of generating good links is my own problem: I should take note of correct gbif dataset_keys of my interest, and also note the way each dataset is formatting its catalognumbers when publishing to GBIF. So my DwC generating script can be aware of prepending a "herpetology-" string to all otherCatalogNumber related to a certain dataset, adding a "-1" at the end of some others, and so on,

So, even if you develop an algorithm capable of creating clusters by using catalognumber in combination with other fields ... I keep my question about the other possibility, at the risk of this being labeled as an off topic here. (no idea if this is technically compatible with your current clustering idea)

Thanks again for all your hard work on this.

timrobertson100 commented 2 years ago

Thanks @abubelinha. Just confirming you aren't being misunderstood - the investigation really has been about whether we could issue guidelines on use of otherCatalogNumber as a publisher-declared link. There really isn't any intelligence going on, just an exploration of how people have populated that field so far, and whether guidelines that use the institution, collection and catalog codes/numbers would be possible now.

One worry with using datasetKey is that it is likely less stable than codes over time. People do repackage their datasets from time to time, where the collection code may - or may not - be more stable.

Another idea we could consider is a publisher providing links to other records using relatedResourceID, which would allow you to strongly link directly to another record in GBIF. That would be at the mercy of GBIF record ID stability, but we now have tombstone pages (example) and have implemented checks to improve record ID stability that make this more attractive. Have you any thoughts on that approach please?

ManonGros commented 2 years ago

there is also the associatedOccurrences field

abubelinha commented 2 years ago

Thanks for the info @timrobertson100 & @ManonGros .

At a first glance, I have the impression that using them might be much more difficult for the average user. Because the use case I propose is like this:

A collection curator knows that some specimens are duplicates of another remote collection specimen (foreign IC+CC+CN is printed in the local label and stored in local database). The number of these remote collections grows with time, but it uses to be more or less limited (a given collection uses to make frequent exchanges with the same institutions as in previous years). The number of linkable specimens grows much more quickly (each of these exchanges could contain dozens or hundreds of duplicate numbered specimens, all coming from the same collection).

Now I am just guessing how linking them could be done:

Correct me if I am wrong. For using relatedResourceID and associatedOccurrences (option A), db-curator needs to:

  1. Find all those related occurrences in GBIF, parse them to grab the needed linkable field value (which may or may be not constructed upon catalogNumber).
  2. Store those field values locally, either in the duplicate specimen record, or in an intermediate table. Easily many thousands of records to store.
  3. Use them to populate either relatedResourceID or associatedOccurrences when producing next IPT DwC mapping.

Instead (option B), by trusting IC+CC+CN (or dataset_key+CN):

  1. Just check (once a year or so) how a remote collection is formatting its catalogNumber fields, plus static value of their IC+CC (or dataset_key, whatever you decides is better).
  2. Store that formatting info (for each remote collection of intereset) somewhere (the bigger institution is, the more exchange with other institutions ... but this table info might vary from a handful to some hundreds records, not much more).
  3. As my local DB already knows many remote catalogNumber fields associated to each local occurrence, this is just a matter of formatting them (according to step 2) in order to fill in dwc:otherCatalogNumbers when publishing to IPT.

A1 is much more difficult and time consuming than B1, I think. Also, for doing A1 searches, we will need to try combinations of IC+CC+CN anyway (or dataset_key+CN). Also, how can we sure that A1-2 do not need to be repeated? (the next time we republish to our IPT, a certain percentage of those foreign occurrence IDs might have changed ... see below).

One worry with using datasetKey is that it is likely less stable than codes over time. People do repackage their datasets from time to time, where the collection code may - or may not - be more stable.

I don't get what you mean with that repackaging. Is that something that would change datasetKey more easily than identificators of each occurrence? Anyway, my use case already assumes that step B1-2 should be repeated from time to time. So those datasetKey changes shouldn't be an issue.

And yes, it's great that GBIF occurrenceID tend to be more and more stable (thanks for the links!).

But in the end, that's just statistics: let's say in year 2020 there were changes in 5% of the occurrenceID, and in 2021 it was only 2% There will always be a small chance of change for any occurrenceID, I think. Wouldn't that imply that steps A1-2 need to be repeated as well from time to time?

It would be good to know opinion of other db curators. Maybe I am wrong in my assumptions of what it would be easier to do in other institutions.

abubelinha commented 2 years ago

Sorry for being too verbose. After reading myself again, I re-edited some of my previous explanations form more context and clarity.

timrobertson100 commented 2 years ago

Thanks @abubelinha

I don't get what you mean with that repackaging

By dataset repackaging, I mean the situation that sometimes people replace their datasets completely after e.g. choosing to merge or split collections or how they share them.

Instead (option B), by trusting IC+CC+CN (or dataset_key+CN):

Option B is certainly the preferred approach.

I've tweaked what I ran before to be more strict to link when otherCatalogNumber matches the combination of IC + CC + CN (ignoring how they are concatenated using /:,_ etc, whitespace and casing).

This yielded only 1139 additional records from the current system (which is good) and looking at a selection I can see they appear reasonable (e.g. this and this co-collected Fungi of different species (same branch perhaps) linked explicitly by the recorder).

@ManonGros: I think it would therefore be reasonable to add a rule that allowed a link to be made for specimens only having otherCatalogNumber overlap with the combination of all three of the IC, CC, CN.

This would be standards-compliant and would provide @abubelinha with what he really seeks - the ability to assert a link to other records using the codes on the label. Formating of those codes would be the responsibility of the publisher making the links and we can document that it needs to be a combination of all 3 parts, encouraging people to update GRSciColl and clean their data.

I've attached the new relationships - Does this seem reasonable to you please @ManonGros?

@abubelinha - it would mean you populate otherCatalogNumbers with a code in the format IC:CC:CN

abubelinha commented 2 years ago

Thanks a lot @timrobertson100 We will try this asap. I have a couple of questions though:

I had not realized till now that GBIF only compared occurrences in different datasets when clustering. I wonder if this could be bypassed JUST for the particular case of otherCatalogNumber linked specimens.

I mean, (Q1) if I use otherCatalogNumber to point to an occurrence within the same dataset, would it be possible to cluster them as well?

Not sure how useful this could become, but at a first glance it would be very interesting to know about these clusters: When using specimens from another institution (or before requesting a loan), if knowing that some of them are repetitions of the same one, this cluster flag would help a lot to exclude them from the study/request. In other cases, it could be useful to know about them all for inclusion in the study (i.e., if one of them is being used for typifying a new taxonomic name, this would help detecting and labelling them all).

Also, this could reveal more links between institutions which would otherwise remain invisible. As an elaborated example, figure out 3 collections (datasets A, B, C) with preserved and numbered specimens.

Along history, collection A has received yearly exchanged specimens from B and C. Based on info in the A labels, curators have created otherCatalogNumber links like these:

institutionCode collectionCode catalogNumber otherCatalogNumber notes
A A 111 B:B:5555 link created from info in labels received from B
A A 222 specimen received form B or C, but info in labels did not reveal any links
A A 333 C:C:6666 link created from info in labels received from C
A A 444 A:A:111 | A:A:222 | A:A:333 links created much later ... when I realized these were all duplicates of the same collection

When organisation A began creating links, they just played attention to numbers in labels of duplicate specimens received from B and C. Which lead to links between organisations (in bold). This info alone, would generate only two clusters with two elements each one.

But other yearly exchanges received from B and C did not contain their numbers in the labels (so they were stored at A without any info permitting to make links to the original B & C specimens).

Later on, when curating my A dataset, I discovered by my own means some internal coincidences between A specimens (i.e. collector names & numbers, location descriptions, dates or whatever).
This leads me to generate internal links (in cursive). If GBIF takes in account these internal links, this would generate a new cluster:

Looking to clusters 1,2,3 altogether, it is obvious that this is all just a big cluster of six elements:

So, asumming B and C curators are not generating this kind of links, and GBIF clustering system is not smart enough to catch these relationships (i.e., collector and places names typed differently in each database) ... playing attention to A internal links could permit to link B:B:5555 and C:C:6666 (so transferring knowledge between these two institutions as well).


Apart from my interest in internal links, this leads me to a different related question. Does GBIF plan to somehow (Q2) reanalyse occurrences present clusters originated by different ways, and check if that could lead to bigger clusters?

Knowing that, GBIF could create:

Thanks

ManonGros commented 2 years ago

I've attached the new relationships - Does this seem reasonable to you please @ManonGros? Thanks @timrobertson100 I checked some more and it really looks good!

timrobertson100 commented 2 years ago

Thanks @ManonGros - I will tidy up the code and merge it in the coming days.

We will try this asap

We haven't implemented it yet @abubelinha, so please hold on with testing yet. You can prepare data knowing this is coming though.

to point to an occurrence within the same dataset, would it be possible to cluster them as well?

Yes, but let's track that in a separate issue, please

timrobertson100 commented 2 years ago

This has just been run in production for the first time.

Two things remain to close this:

  1. Merge the code (a couple minor changes sit on a different computer so I'll do that when next in the office)
  2. Update the blog with the details on how to assert the link @ManonGros
abubelinha commented 2 years ago

I wonder if it makes any sense using gbif-uat.org as a testing environment for otherCatalogNumbers links. That would be useless since clusters are only run on production portal. Correct?

BTW. How frequently are clusters generated? Is this planned/documented somewhere? (so we can plan to update datasets before those dates)

Thanks!

timrobertson100 commented 2 years ago

Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).

Clustering at the moment runs periodically, but not on a schedule while it is an experimental feature. We expect it will be run ~weekly in the future, possibly more frequently.

ManonGros commented 1 year ago

@timrobertson100 the blogpost/documentation should now be up to date

abubelinha commented 1 year ago

Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).

This is a bit off-topic but not that much: I am interested in using UAT environment a lot for testing. And in this human-curated clusters context, I am also interested in somehow linking myself the items in those clusters (so they are linked even if gbif clustering system does not exist or stops working).

With linking myself I mean actually introducing a set of clickable references in each occurrence, so user who is viewing that occurrence can click and navigate to the other occurrence urls (and these links should also be available through api and downloaded data). What would be the recommended way of doing this when creating our dataset and mapping our data in IPT? (maybe I need to use a certain DwC extension)

So my question could be addressed in different ways:

  1. Are there any DwC fields where I can actually feed html content which works in both API and portal interfaces? (just for manually creating html links)
  2. Better: is there any DwC field specifically suitable for this (so when I publish IPT dataset, gbif interprets data and creates links pointing to the other occurrences) so I don't have to create html links myself?
  3. Best but probably impossible: I guess the approaches 1 or 2 would mean I have to previously collect gbifIDs of my target occurrences in order to create these links. But I'd prefer just doing as @timrobertson100 said above: "I populate otherCatalogNumbers with a code in the format IC:CC:CN" ... and as far the triplet is unique GBIF can interpret this info to find the gbifID and link the occurrences in user/api interfaces.

I know this probably deserves a new issue but I doubt nobody else has suggested such a thing before (I searched github issues but still coudln't find). If this is not a duplicate question, where should I open it? pipelines? ipt?

For now I post the idea here, as this is double interesting for me in current clustering issue scenario:

Thanks a lot @timrobertson100 @ManonGros

abubelinha commented 6 months ago

Hi again @ManonGros @timrobertson100

A side question related to this issue.
Using the occurrence search API with is_in_cluster=true option, today I discovered ~1500 items of my institution forming part of clusters.

But when I tried to get the information about the related occurrences, 5 of those clusters had disappeared (empty json):

But if you remove the experimental/related part of the urls, you still can see the "isInCluster": true value.

I suppose the related elements changed in some way so cluster conditions are not meet anymore, but the isInCluster value takes a while in being updated. Right?

That's fine. But is there any way to check past status of a cluster? I am interested in tracking those changes to detect if our dataset info should be changed accordingly. Looking to this past comment it seems to be possible:

(it) was never in this cluster which I can confirm by looking at the backend database that holds the links.

Is there any way I can do it myself? (recover the info about which occurrences were previously in a now "disappeared cluster"). I mean, by using whatever json api call or other type of query I can launch from Python.

Thanks a lot in advance

timrobertson100 commented 6 months ago

I'm very sorry for the slow reply @abubelinha

I suppose the related elements changed in some way so cluster conditions are not meet anymore, but the isInCluster value takes a while in being updated. Right?

That's correct. It's a little convoluted but we run a batch job daily (normally) to relate similar records and store links. When a dataset is reprocessed, it looks in the "links" table and determines if the record has been linked before. If so, isInCluster is set to true in the search index (We're aware this has lots of room for improvement). I'll trigger a reprocessing of your dataset now to update the search index.

We don't hold history of this table so it's not easy to determine what those links may have once been. I suspect(?) that either the target institution or collection codes you have used may have changed for records you curated, or there were records in there that clustered algorithmically (e.g. to an observation or so) which have changed - possibly due to recent changes to eventDate processing to better support date ranges.

I know this doesn't help find the broken links but hopefully it explains what has happened.

abubelinha commented 6 months ago

Thanks for explaining the reasons @timrobertson100

We are interested in exploring clusters related to our datasets and how those relations change over time. I'd like track the related ocurrences in my own table, like this (sorted by 1, 2, 3 ASC):

date followed_gbifId related_gbifId
2024.03.01 12345 888888
2024.03.01 12345 987654
2024.03.01 32123 777777
2024.04.01 12345 888888
2024.04.01 32123 777777
... ... ...

So we can query this table to select those occurrences which were related at some point, but they are not anymore (i.e. 12345 & 987654) ... and check the reasons (if taxonomic identification has changed in 987654, we might want to review our own identification in 12345).

I would do the following (monthly or so):

  1. Use occurrence search API with is_in_cluster=true option
  2. Loop all those occurrences and get the related gbifIds with /occurrence/{our_gbifId}/experimental/related
  3. Extract and insert into the above table just the paired gbifIds and current date

Step 2 means running thousands of api queries and downloading much more info than I need. Is there a more direct way to get the same info using experimental api-sql-downloads?
I am not sure if the is_in_cluster or related occurrences info are available there. Also I don't know if field syntax changes somehow compared to regular api (i.e. datasetKey vs dataset_key)

Being experimental, would our regular download api authentication credentials be enough to test it, or do we need to ask helpdesk for permission?

Thanks!

timrobertson100 commented 6 months ago

Thanks again

I suggest a step before to force a refresh of your dataset and ensure the "is_in_cluster" is up to date.

You would do this by 1) recreating your DwC-A or simply changing the timestamp on the file on your webserver (in linux you'd do a touch my-file.dwca but I recall you are on windows?) and then 2) issuing an empty authenticated POST to https://api.gbif.org/v1/dataset/1c334170-7ed1-11df-8c4a-0800200c9a66/crawl. GBIF crawlers will notice the timestamp is newer and that will force a full reprocessing of your dataset.

Give some time for that process to complete (e.g. 1hr) and then proceed with your 1, 2, 3 steps.

I'm afraid the SQL API would only allow you to achieve 1) at the moment, and given it's not a huge dataset I think that your approach is likely the better one. The SQL API would allow something like SELECT gbifID FROM occurrence WHERE datasetKey=... AND isInCluster IS TRUE (I haven't verified the syntax). You'd still need to issue the 1500 individual calls I'm afraid for the time being.

As an aside, you can see from your ingestion history that we notice that the dataset rarely changes. I don't know if this is as you would expect?

We'll help in any way we can

abubelinha commented 6 months ago

Thanks again @timrobertson100 for the explanations.

I guess you think we republish our dataset(s) frequently so I want to catch possible new clusters and detect those which have changed. Yes indeed, but the first part of the assumption is not yet true (as you noticed from the ingestion history: that's right).

So why am I bothering with this right now? I was interested in catching other datasets' updates which might cause cluster changes. Despite our datasets not being frequently updated, some of the related occurrences may change anyway (for example when we send duplicate specimens to other institution and they update their datasets, which may generate new related occurrences). Also, some previously-related occurrences may change unexpectedly as I reported above: 5 out of 1500 clusters did not "disappear" due to any changes on our side. Those are my main concern.

That's why I planned to do this monthly.

Of course, when we republish our datasets I'll wait for crawling before trying to catch cluster changes and update my table, as you suggest. I expect the 5/1500 proportion to be higher in that case.

So I understand there is no copy of the "cluster-relations tables" which we can query from the SQL API. I was interested in that approach with another SELECT in my mind, not based in a datasetKey but a long list of gbifIDs.

That way I could track not only our own datasets occurrences, but also those which were once upon a time on the right column of my table above (but not anymore): If they were once in same cluster, there is big chance that they should still be and I still want to track their taxonomic identification changes. A good example is when some of the related items change their georeference accuracy:

As for the coordinates difference, I am afraid all occurrences in the cluster have been rounded to 0.01 Lat/Lon degrees precision. The cluster-excluded observation hasn't been rounded, which explains the discrepance.

Thanks - I had suspected something like that had happened and your record actually had the more accurate georeference. I chose the limits of 200m and 2km to accommodate 3 and 4 decimal place rounding globally, but here we are at 2 decimal places.

timrobertson100 commented 6 months ago

Excellent, thanks.

So I understand there is no copy of the "cluster-relations tables" which we can query from the SQL API.

That's right, at least today and in honesty, I don't imagine it'll be there in the short term.

As things develop, it would be really interesting for us to learn how much useful information you gain from the clusters - e.g. a new identification you can apply or so. It's one of those things we imagine might be happening, but never really know...

abubelinha commented 6 months ago

We are testing the SQL API and I am a bit confused about how to filter by isInCluster field values. Looks like text case doesn't matter: the query validation system converts isInCluster to occurrence.isincluster. But I am not sure about the data type: documentation says both gbifid and isincluster are string data types. Is that a mistake? (I would expect numeric and boolean):

Column name Data type Nullable Definition
gbifid String No Unique GBIF key for the occurrence.We aim to keep these keys stable, but this is not possible in every case.
isincluster String Yes Experimental. Whether the occurrence belongs to a machine-calculated cluster of probable duplicate occurrences.

That said, yesterday I tried several ways but none of them seems to be working (waited a few hours, and either failed or didn't end). I have now cancelled and re-run again:

I am not sure whether using TRUE/true (for both booleans and strings) should make a difference. (but all of them had previously passed SQL validation system)

MattBlissett commented 6 months ago

Hi,

Thanks for pointing out the problem with the documentation.

It seems "IS TRUE" isn't working correctly, but the other two forms should work. = true is more correct, the = 'true' one is converting the 'true' string to a boolean.

Our cluster was busy yesterday processing the updated eBird dataset, and there's currently a bug giving SQL downloads a low priority. These downloads will often be slower than the older predicate-based downloads, but I expect the two that are currently running to complete.

abubelinha commented 6 months ago

Thanks for explaining @MattBlissett

Could any of you confirm whether the UAT-non-clustering situation has changed?

Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).

I think the UAT-occurrence data is showing clusters now: https://www.gbif-uat.org/occurrence/3083663793/cluster

BTW, regarding SQL downloads, I'd suggest to always include UAT system by default (when helpdesk gives access to to interested users). So we avoid having those requests running on production site just for testing purposes.

timrobertson100 commented 6 months ago

Could any of you confirm whether https://github.com/gbif/pipelines/issues/781#issuecomment-1293431220 has changed?

It has - we do now cluster in UAT, but note that UAT is now much smaller in data volume than it once was so it will detect few matches and give results for whatever is in the test environment at that time. Please be aware that clustering is run daily in UAT, but that might be reduced at busy times.

abubelinha commented 6 months ago

Thanks!

Cluster in UAT is great, so we can test the usage of otherCatalognumber to force human-curated clustering.

Is own-dataset clustering already possible using that otherCatalognumber way?

timrobertson100 commented 6 months ago

Is https://github.com/gbif/pipelines/issues/781#issuecomment-1257766417 already possible using that otherCatalognumber way?

Sorry, it is not at the moment

abubelinha commented 6 months ago

OK. I hope you didn't finally decide it was not possible:

to point to an occurrence within the same dataset, would it be possible to cluster them as well?

Yes, but let's track that in a separate issue, please

I suspect this was finally forgotten. Should I open a new issue or had you already created one about that? EDIT: I opened https://github.com/gbif/pipelines/issues/1056 just in case


As for the currently working inter-datasets clustering I see other issues that don't mention spaces before and after vertical bars, but I understand they should be used to separate multiple catalogue numbers triplets, as per DwC:otherCatalogueNumbers documentation.

'institutionCode:collectionCode:catalogNumber | institutionCode:collectionCode:catalogNumber | ...'

EDIT: as you suggested, I would use ":" as the internal separator character between the 3 triplet parts.

But what if we find some providers already using ":" inside any of those 3 parts?

  1. institutionCode: XYZ
  2. collectionCode: Herbarium:Algae <---
  3. catalogNumber: 642:A <---

Should we perhaps surround those parts with double quotes to remark that all text inside belongs to that part of the triplet?

I'd suggest to update the blog and show some examples for tricky situations, so most people who provides this information starts to use this field in a consistent way.

Thanks !