Open abubelinha opened 2 years ago
BTW, is this the correct github/gbif place to raise this question? I have seen clustering-related issues in both portal-feedback and pipelines
Thanks @abubelinha - this makes a lot of sense (and this repository is a good place to record it)
The rules as described in the blog refer to identifiers overlap and otherCatalogNumber
is included in that overlap, but currently, it requires other things (e.g. same species and location) to also align to trigger a match - this was simply a defensive approach to avoid too many false matches for those using e.g. running integers for IDs.
I'll run some queries shortly and try and summarize how it is used and how well we could rely on ´otherCatalogNumber`. I think it would be preferable to use the IC:CC:CN approach rather than the GBIF dataset key to avoid this being too GBIF-specific.
This post serves to document some basic data research on the potential for otherCatalogNumbers to inform clusters.
Using the following tables as a basis:
CREATE TABLE tim.cat STORED AS parquet AS
SELECT gbifid, datasetkey, institutionCode, collectionCode, catalogNumber, otherCatalogNumbers, kingdom, scientificName, acceptedScientificName, taxonKey, acceptedTaxonKey
FROM prod_h.occurrence
WHERE basisOfRecord='PRESERVED_SPECIMEN';
CREATE TABLE tim.other STORED AS PARQUET AS
SELECT gbifid, datasetkey, otherCatalogNumber
FROM tim.cat LATERAL VIEW explode(otherCatalogNumbers) o AS otherCatalogNumber
WHERE size(otherCatalogNumbers)>0;
There are 30,073,790 records from 1003 datasets listing 30,337,534 otherCatalogNumbers.
SELECT count(DISTINCT gbifID), count(DISTINCT datasetKey), count(*) from tim.other;
Looking at the format of the identifiers, we anticipate some delimitation of institutionCode, collectionCode, catalogNumber.
SELECT t.colonCount, count(*) AS freq
FROM
(SELECT LENGTH(regexp_replace(otherCatalogNumber,'[^:]','')) AS colonCount
FROM tim.other) t
GROUP BY t.colonCount ORDER BY t.colonCount;
Delimiter | Frequency | Number of records | Notes |
---|---|---|---|
: | 0 | 19009829 | Expected |
: | 1 | 3585636 | institutionCode : catalogNumber |
: | 2 | 4426228 | institutionCode : collectionCode : catalogNumber |
: | 3 | 93736 | |
: | 4 | 1656576 | e.g. LSID, urn:catalog:RBINS:IG:17225 (48 datasets using this format) |
: | 5 | 1561015 | LSID with version (and ; delimited identifiers - publisher has been notified) |
: | 6 | 34 | |
: | 8 | 4400 | |
: | 9 | 1 | |
: | 10 | 28 | |
: | 12 | 21 | |
: | 14 | 29 | |
: | 20 | 1 | |
- | 0 | 30160859 | Expected |
- | 1 | 170952 | institutionCode - catalogNumber |
- | 2 | 4438 | institutionCode - collectionCode - catalogNumber |
- | 3 | 1201 | |
- | 4 | 22 | |
- | 5 | 59 | |
- | 7 | 2 | |
- | 8 | 1 |
Next up, I'll look at the performance of some of the joins.
(this needs checked, but leaving here as I pause this for a couple of days)
Assuming the catalogNumber itself is tokenized (e.g. AB12
or MAMM:AB12
) we can get a sense of how many records might link if we materialize the ways catalogNumbers are commonly used:
SELECT count(DISTINCT m.gbifID) AS records, count(DISTINCT m.datasetKey) AS datasets
FROM tim.other o
JOIN (
SELECT t.gbifID, t.datasetKey, code FROM
(SELECT
gbifID, datasetKey,
array(
concat_ws(':', institutionCode, catalogNumber),
concat_ws(':', institutionCode, collectionCode, catalogNumber),
concat_ws('-', institutionCode, catalogNumber),
concat_ws('-', institutionCode, collectionCode, catalogNumber),
concat_ws(':', 'urn:catalog', institutionCode, collectionCode, catalogNumber)
) as codes
FROM tim.cat) t LATERAL VIEW explode(codes) x AS code) m
ON m.code=o.otherCatalogNumber
WHERE o.datasetKey != m.datasetKey AND o.gbifID<m.gbifID;
> 3,439,777 records across 402 datasets
I've been exploring algorithms here, and want to leave a note on why this gets difficult. The algorithm I currently have ignores the delimiters (,:/
etc) but matches otherCatalogNumber
on record 1 with IC:CN
and IC:CC:CN
on record 2.
UCLA W57-41 and UCLA : W57-41 is a nice example of where otherCatalogNumber
alone can be matched to seemingly good effect(?), detecting a good link of a reidentified record.
However, TNHC2799 and TNHC:2799 illustrate where this can go wrong (I think).
There is so much variation in how codes are constructed that I think we need more than catalogNumber
alone to make the match. Perhaps a locality, or a date or a person in addition to the code overlap?
Alternatively, we could explore what a stricter match using only the IC:CC:CN version (no IC:CN) might yield. One would presume a triplet overlap of codes was a strong signal of a relationship - perhaps adding a Kingdom scope (although presumably there are cases that would cross kingdoms related to e.g. host relationships)
@timrobertson100 Yes I think you are right, the catalogue number overlap alone isn't enough to infer the link.
If possible at all, I think excluding the matches that have different recordedBy
would help clean up some false positive. Like we already do the matches that have conflicting dates or location.
Otherwise having an identifier overlap and same recorder or date is a good idea. The locality might be really difficult to compare but perhaps the state province when filled? Would that be possible?
Sorry but I feel I asked for something a little bit different of what you are trying to do.
Probably my misunderstanding of how the clustering works led me to ask for a feature which is not easy to implement here.
These clusters look to me as an example of computer intelligence working to find out groups of occurrences that have a high probability of being duplicates of each other.
But I want to stress the "human-curated" words in the title. What I was asking is giving humans (database curators) a chance to "force" items into a cluster, no matter what computers say about that. Let curators do the work of finding out if two gbif occurrences are the same or not. And let them have a way of doing it, even if this may lead to some wrong links (well, let's see how many of them are finally wrong).
That's the reason I suggested to use dataset_key.
I think it would be preferable to use the IC:CC:CN approach rather than the GBIF dataset key to avoid this being too GBIF-specific.
I agree @timrobertson100 , this is absolutely too GBIF-specific. But who cares? This clustering functionality is a GBIF-specific test service. And as you are labeling clusters according to which matching algorithm was used to produce them, why not having one label which prompts "human-forced-cluster" (trust it at your own risk).
As a human curator, taking care of generating good links is my own problem: I should take note of correct gbif dataset_keys of my interest, and also note the way each dataset is formatting its catalognumbers when publishing to GBIF.
So my DwC generating script can be aware of prepending a "herpetology-" string to all otherCatalogNumber
related to a certain dataset, adding a "-1" at the end of some others, and so on,
So, even if you develop an algorithm capable of creating clusters by using catalognumber in combination with other fields ... I keep my question about the other possibility, at the risk of this being labeled as an off topic here. (no idea if this is technically compatible with your current clustering idea)
Thanks again for all your hard work on this.
Thanks @abubelinha. Just confirming you aren't being misunderstood - the investigation really has been about whether we could issue guidelines on use of otherCatalogNumber
as a publisher-declared link. There really isn't any intelligence going on, just an exploration of how people have populated that field so far, and whether guidelines that use the institution, collection and catalog codes/numbers would be possible now.
One worry with using datasetKey is that it is likely less stable than codes over time. People do repackage their datasets from time to time, where the collection code may - or may not - be more stable.
Another idea we could consider is a publisher providing links to other records using relatedResourceID
, which would allow you to strongly link directly to another record in GBIF. That would be at the mercy of GBIF record ID stability, but we now have tombstone pages (example) and have implemented checks to improve record ID stability that make this more attractive. Have you any thoughts on that approach please?
there is also the associatedOccurrences
field
Thanks for the info @timrobertson100 & @ManonGros .
relatedResourceID
and associatedOccurrences
for this purpose? occurrenceID,
key
, gbifID
... I guess occurrenceID
, but I am not sure.At a first glance, I have the impression that using them might be much more difficult for the average user. Because the use case I propose is like this:
A collection curator knows that some specimens are duplicates of another remote collection specimen (foreign IC+CC+CN is printed in the local label and stored in local database). The number of these remote collections grows with time, but it uses to be more or less limited (a given collection uses to make frequent exchanges with the same institutions as in previous years). The number of linkable specimens grows much more quickly (each of these exchanges could contain dozens or hundreds of duplicate numbered specimens, all coming from the same collection).
Now I am just guessing how linking them could be done:
Correct me if I am wrong. For using relatedResourceID
and associatedOccurrences
(option A), db-curator needs to:
catalogNumber
). relatedResourceID
or associatedOccurrences
when producing next IPT DwC mapping.Instead (option B), by trusting IC+CC+CN (or dataset_key
+CN):
catalogNumber
fields, plus static value of their IC+CC (or dataset_key
, whatever you decides is better). dwc:otherCatalogNumbers
when publishing to IPT.A1 is much more difficult and time consuming than B1, I think. Also, for doing A1 searches, we will need to try combinations of IC+CC+CN anyway (or dataset_key+CN). Also, how can we sure that A1-2 do not need to be repeated? (the next time we republish to our IPT, a certain percentage of those foreign occurrence IDs might have changed ... see below).
One worry with using datasetKey is that it is likely less stable than codes over time. People do repackage their datasets from time to time, where the collection code may - or may not - be more stable.
I don't get what you mean with that repackaging. Is that something that would change datasetKey
more easily than identificators of each occurrence?
Anyway, my use case already assumes that step B1-2 should be repeated from time to time. So those datasetKey
changes shouldn't be an issue.
And yes, it's great that GBIF occurrenceID tend to be more and more stable (thanks for the links!).
But in the end, that's just statistics: let's say in year 2020 there were changes in 5% of the occurrenceID, and in 2021 it was only 2% There will always be a small chance of change for any occurrenceID, I think. Wouldn't that imply that steps A1-2 need to be repeated as well from time to time?
It would be good to know opinion of other db curators. Maybe I am wrong in my assumptions of what it would be easier to do in other institutions.
Sorry for being too verbose. After reading myself again, I re-edited some of my previous explanations form more context and clarity.
Thanks @abubelinha
I don't get what you mean with that repackaging
By dataset repackaging, I mean the situation that sometimes people replace their datasets completely after e.g. choosing to merge or split collections or how they share them.
Instead (option B), by trusting IC+CC+CN (or dataset_key+CN):
Option B is certainly the preferred approach.
I've tweaked what I ran before to be more strict to link when otherCatalogNumber
matches the combination of IC + CC + CN (ignoring how they are concatenated using /:,_
etc, whitespace and casing).
This yielded only 1139 additional records from the current system (which is good) and looking at a selection I can see they appear reasonable (e.g. this and this co-collected Fungi of different species (same branch perhaps) linked explicitly by the recorder).
@ManonGros: I think it would therefore be reasonable to add a rule that allowed a link to be made for specimens only having otherCatalogNumber
overlap with the combination of all three of the IC, CC, CN.
This would be standards-compliant and would provide @abubelinha with what he really seeks - the ability to assert a link to other records using the codes on the label. Formating of those codes would be the responsibility of the publisher making the links and we can document that it needs to be a combination of all 3 parts, encouraging people to update GRSciColl and clean their data.
I've attached the new relationships - Does this seem reasonable to you please @ManonGros?
@abubelinha - it would mean you populate otherCatalogNumbers
with a code in the format IC:CC:CN
Thanks a lot @timrobertson100 We will try this asap. I have a couple of questions though:
I had not realized till now that GBIF only compared occurrences in different datasets when clustering.
I wonder if this could be bypassed JUST for the particular case of otherCatalogNumber
linked specimens.
I mean, (Q1) if I use otherCatalogNumber
to point to an occurrence within the same dataset, would it be possible to cluster them as well?
Not sure how useful this could become, but at a first glance it would be very interesting to know about these clusters: When using specimens from another institution (or before requesting a loan), if knowing that some of them are repetitions of the same one, this cluster flag would help a lot to exclude them from the study/request. In other cases, it could be useful to know about them all for inclusion in the study (i.e., if one of them is being used for typifying a new taxonomic name, this would help detecting and labelling them all).
Also, this could reveal more links between institutions which would otherwise remain invisible. As an elaborated example, figure out 3 collections (datasets A, B, C) with preserved and numbered specimens.
Along history, collection A has received yearly exchanged specimens from B and C.
Based on info in the A labels, curators have created otherCatalogNumber
links like these:
institutionCode |
collectionCode |
catalogNumber |
otherCatalogNumber |
notes |
---|---|---|---|---|
A | A | 111 | B:B:5555 | link created from info in labels received from B |
A | A | 222 | specimen received form B or C, but info in labels did not reveal any links | |
A | A | 333 | C:C:6666 | link created from info in labels received from C |
A | A | 444 | A:A:111 | A:A:222 | A:A:333 | links created much later ... when I realized these were all duplicates of the same collection |
When organisation A began creating links, they just played attention to numbers in labels of duplicate specimens received from B and C. Which lead to links between organisations (in bold). This info alone, would generate only two clusters with two elements each one.
But other yearly exchanges received from B and C did not contain their numbers in the labels (so they were stored at A without any info permitting to make links to the original B & C specimens).
Later on, when curating my A dataset, I discovered by my own means some internal coincidences between A specimens (i.e. collector names & numbers, location descriptions, dates or whatever).
This leads me to generate internal links (in cursive).
If GBIF takes in account these internal links, this would generate a new cluster:
Looking to clusters 1,2,3 altogether, it is obvious that this is all just a big cluster of six elements:
So, asumming B and C curators are not generating this kind of links, and GBIF clustering system is not smart enough to catch these relationships (i.e., collector and places names typed differently in each database) ... playing attention to A internal links could permit to link B:B:5555 and C:C:6666 (so transferring knowledge between these two institutions as well).
Apart from my interest in internal links, this leads me to a different related question. Does GBIF plan to somehow (Q2) reanalyse occurrences present clusters originated by different ways, and check if that could lead to bigger clusters?
Knowing that, GBIF could create:
Thanks
I've attached the new relationships - Does this seem reasonable to you please @ManonGros? Thanks @timrobertson100 I checked some more and it really looks good!
Thanks @ManonGros - I will tidy up the code and merge it in the coming days.
We will try this asap
We haven't implemented it yet @abubelinha, so please hold on with testing yet. You can prepare data knowing this is coming though.
to point to an occurrence within the same dataset, would it be possible to cluster them as well?
Yes, but let's track that in a separate issue, please
This has just been run in production for the first time.
Two things remain to close this:
I wonder if it makes any sense using gbif-uat.org as a testing environment for otherCatalogNumbers
links.
That would be useless since clusters are only run on production portal.
Correct?
BTW. How frequently are clusters generated? Is this planned/documented somewhere? (so we can plan to update datasets before those dates)
Thanks!
Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).
Clustering at the moment runs periodically, but not on a schedule while it is an experimental feature. We expect it will be run ~weekly in the future, possibly more frequently.
@timrobertson100 the blogpost/documentation should now be up to date
Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).
This is a bit off-topic but not that much: I am interested in using UAT environment a lot for testing. And in this human-curated clusters context, I am also interested in somehow linking myself the items in those clusters (so they are linked even if gbif clustering system does not exist or stops working).
With linking myself I mean actually introducing a set of clickable references in each occurrence, so user who is viewing that occurrence can click and navigate to the other occurrence urls (and these links should also be available through api and downloaded data). What would be the recommended way of doing this when creating our dataset and mapping our data in IPT? (maybe I need to use a certain DwC extension)
So my question could be addressed in different ways:
otherCatalogNumbers
with a code in the format IC:CC:CN
" ... and as far the triplet is unique GBIF can interpret this info to find the gbifID and link the occurrences in user/api interfaces.I know this probably deserves a new issue but I doubt nobody else has suggested such a thing before (I searched github issues but still coudln't find). If this is not a duplicate question, where should I open it? pipelines? ipt?
For now I post the idea here, as this is double interesting for me in current clustering issue scenario:
Thanks a lot @timrobertson100 @ManonGros
Hi again @ManonGros @timrobertson100
A side question related to this issue.
Using the occurrence search API with is_in_cluster=true
option, today I discovered ~1500 items of my institution forming part of clusters.
But when I tried to get the information about the related occurrences, 5 of those clusters had disappeared (empty json):
But if you remove the experimental/related
part of the urls, you still can see the "isInCluster": true
value.
I suppose the related elements changed in some way so cluster conditions are not meet anymore, but the isInCluster
value takes a while in being updated. Right?
That's fine. But is there any way to check past status of a cluster? I am interested in tracking those changes to detect if our dataset info should be changed accordingly. Looking to this past comment it seems to be possible:
(it) was never in this cluster which I can confirm by looking at the backend database that holds the links.
Is there any way I can do it myself? (recover the info about which occurrences were previously in a now "disappeared cluster"). I mean, by using whatever json api call or other type of query I can launch from Python.
Thanks a lot in advance
I'm very sorry for the slow reply @abubelinha
I suppose the related elements changed in some way so cluster conditions are not meet anymore, but the isInCluster value takes a while in being updated. Right?
That's correct. It's a little convoluted but we run a batch job daily (normally) to relate similar records and store links.
When a dataset is reprocessed, it looks in the "links" table and determines if the record has been linked before. If so, isInCluster
is set to true in the search index (We're aware this has lots of room for improvement). I'll trigger a reprocessing of your dataset now to update the search index.
We don't hold history of this table so it's not easy to determine what those links may have once been.
I suspect(?) that either the target institution or collection codes you have used may have changed for records you curated, or there were records in there that clustered algorithmically (e.g. to an observation or so) which have changed - possibly due to recent changes to eventDate
processing to better support date ranges.
I know this doesn't help find the broken links but hopefully it explains what has happened.
Thanks for explaining the reasons @timrobertson100
We are interested in exploring clusters related to our datasets and how those relations change over time. I'd like track the related ocurrences in my own table, like this (sorted by 1, 2, 3 ASC):
date | followed_gbifId | related_gbifId |
---|---|---|
2024.03.01 | 12345 | 888888 |
2024.03.01 | 12345 | 987654 |
2024.03.01 | 32123 | 777777 |
2024.04.01 | 12345 | 888888 |
2024.04.01 | 32123 | 777777 |
... | ... | ... |
So we can query this table to select those occurrences which were related at some point, but they are not anymore (i.e. 12345 & 987654) ... and check the reasons (if taxonomic identification has changed in 987654, we might want to review our own identification in 12345).
I would do the following (monthly or so):
/occurrence/{our_gbifId}/experimental/related
Step 2 means running thousands of api queries and downloading much more info than I need.
Is there a more direct way to get the same info using experimental api-sql-downloads?
I am not sure if the is_in_cluster
or related
occurrences info are available there.
Also I don't know if field syntax changes somehow compared to regular api (i.e. datasetKey
vs dataset_key
)
Being experimental, would our regular download api authentication credentials be enough to test it, or do we need to ask helpdesk for permission?
Thanks!
Thanks again
I suggest a step before to force a refresh of your dataset and ensure the "is_in_cluster" is up to date.
You would do this by 1) recreating your DwC-A or simply changing the timestamp on the file on your webserver (in linux you'd do a touch my-file.dwca
but I recall you are on windows?) and then 2) issuing an empty authenticated POST
to https://api.gbif.org/v1/dataset/1c334170-7ed1-11df-8c4a-0800200c9a66/crawl. GBIF crawlers will notice the timestamp is newer and that will force a full reprocessing of your dataset.
Give some time for that process to complete (e.g. 1hr) and then proceed with your 1, 2, 3 steps.
I'm afraid the SQL API would only allow you to achieve 1) at the moment, and given it's not a huge dataset I think that your approach is likely the better one. The SQL API would allow something like SELECT gbifID FROM occurrence WHERE datasetKey=... AND isInCluster IS TRUE
(I haven't verified the syntax). You'd still need to issue the 1500 individual calls I'm afraid for the time being.
As an aside, you can see from your ingestion history that we notice that the dataset rarely changes. I don't know if this is as you would expect?
We'll help in any way we can
Thanks again @timrobertson100 for the explanations.
I guess you think we republish our dataset(s) frequently so I want to catch possible new clusters and detect those which have changed. Yes indeed, but the first part of the assumption is not yet true (as you noticed from the ingestion history: that's right).
So why am I bothering with this right now? I was interested in catching other datasets' updates which might cause cluster changes. Despite our datasets not being frequently updated, some of the related occurrences may change anyway (for example when we send duplicate specimens to other institution and they update their datasets, which may generate new related occurrences). Also, some previously-related occurrences may change unexpectedly as I reported above: 5 out of 1500 clusters did not "disappear" due to any changes on our side. Those are my main concern.
That's why I planned to do this monthly.
Of course, when we republish our datasets I'll wait for crawling before trying to catch cluster changes and update my table, as you suggest. I expect the 5/1500 proportion to be higher in that case.
So I understand there is no copy of the "cluster-relations tables" which we can query from the SQL API.
I was interested in that approach with another SELECT in my mind, not based in a datasetKey
but a long list of gbifIDs
.
That way I could track not only our own datasets occurrences, but also those which were once upon a time on the right column of my table above (but not anymore): If they were once in same cluster, there is big chance that they should still be and I still want to track their taxonomic identification changes. A good example is when some of the related items change their georeference accuracy:
As for the coordinates difference, I am afraid all occurrences in the cluster have been rounded to 0.01 Lat/Lon degrees precision. The cluster-excluded observation hasn't been rounded, which explains the discrepance.
Thanks - I had suspected something like that had happened and your record actually had the more accurate georeference. I chose the limits of 200m and 2km to accommodate 3 and 4 decimal place rounding globally, but here we are at 2 decimal places.
Excellent, thanks.
So I understand there is no copy of the "cluster-relations tables" which we can query from the SQL API.
That's right, at least today and in honesty, I don't imagine it'll be there in the short term.
As things develop, it would be really interesting for us to learn how much useful information you gain from the clusters - e.g. a new identification you can apply or so. It's one of those things we imagine might be happening, but never really know...
We are testing the SQL API and I am a bit confused about how to filter by isInCluster field values.
Looks like text case doesn't matter: the query validation system converts isInCluster
to occurrence.isincluster
.
But I am not sure about the data type: documentation says both gbifid
and isincluster
are string data types. Is that a mistake? (I would expect numeric and boolean):
Column name | Data type | Nullable | Definition |
---|---|---|---|
gbifid | String | No | Unique GBIF key for the occurrence.We aim to keep these keys stable, but this is not possible in every case. |
isincluster | String | Yes | Experimental. Whether the occurrence belongs to a machine-calculated cluster of probable duplicate occurrences. |
That said, yesterday I tried several ways but none of them seems to be working (waited a few hours, and either failed or didn't end). I have now cancelled and re-run again:
I am not sure whether using TRUE/true (for both booleans and strings) should make a difference. (but all of them had previously passed SQL validation system)
Hi,
Thanks for pointing out the problem with the documentation.
It seems "IS TRUE" isn't working correctly, but the other two forms should work. = true
is more correct, the = 'true'
one is converting the 'true'
string to a boolean.
Our cluster was busy yesterday processing the updated eBird dataset, and there's currently a bug giving SQL downloads a low priority. These downloads will often be slower than the older predicate-based downloads, but I expect the two that are currently running to complete.
Thanks for explaining @MattBlissett
Could any of you confirm whether the UAT-non-clustering situation has changed?
Thanks, @abubelinha. I'm afraid our UAT environment is not equipped to do clustering (it's very resource intensive).
I think the UAT-occurrence data is showing clusters now: https://www.gbif-uat.org/occurrence/3083663793/cluster
BTW, regarding SQL downloads, I'd suggest to always include UAT system by default (when helpdesk gives access to to interested users). So we avoid having those requests running on production site just for testing purposes.
Could any of you confirm whether https://github.com/gbif/pipelines/issues/781#issuecomment-1293431220 has changed?
It has - we do now cluster in UAT, but note that UAT is now much smaller in data volume than it once was so it will detect few matches and give results for whatever is in the test environment at that time. Please be aware that clustering is run daily in UAT, but that might be reduced at busy times.
Thanks!
Cluster in UAT is great, so we can test the usage of otherCatalognumber
to force human-curated clustering.
Is own-dataset clustering already possible using that otherCatalognumber
way?
Is https://github.com/gbif/pipelines/issues/781#issuecomment-1257766417 already possible using that otherCatalognumber way?
Sorry, it is not at the moment
OK. I hope you didn't finally decide it was not possible:
to point to an occurrence within the same dataset, would it be possible to cluster them as well?
Yes, but let's track that in a separate issue, please
I suspect this was finally forgotten. Should I open a new issue or had you already created one about that? EDIT: I opened https://github.com/gbif/pipelines/issues/1056 just in case
As for the currently working inter-datasets clustering I see other issues that don't mention spaces before and after vertical bars, but I understand they should be used to separate multiple catalogue numbers triplets, as per DwC:otherCatalogueNumbers
documentation.
'institutionCode:collectionCode:catalogNumber | institutionCode:collectionCode:catalogNumber | ...'
EDIT: as you suggested, I would use ":" as the internal separator character between the 3 triplet parts.
But what if we find some providers already using ":" inside any of those 3 parts?
XYZ
Herbarium:Algae
<---642:A
<---Should we perhaps surround those parts with double quotes to remark that all text inside belongs to that part of the triplet?
XYZ:"Herbarium:Algae":"642:A" | USCBG:"Mus:alg":3824.b | ...
I'd suggest to update the blog and show some examples for tricky situations, so most people who provides this information starts to use this field in a consistent way.
Thanks !
I come from portal-feedback#3973
I was suggesting a pure dataset+catalogueNumber combination as a way to permit human-curated clusters. As long as we follow strict rules to fill in those dwc field values with data taken from GBIF, I think those links could be trustable. (perhaps the
catalogueNumber
part may be also standardised when clustering, as suggested in #777)i.e., we could use triplets of
institutionCode
:collectionCode
:catalogNumber
to fill in values indwc:otherCatalogNumbers
, like this: SANT:SANT:44553GBIF just needs to find and group the occurrences which match their
dwc:otherCatalogNumbers
values in order to put them in a given cluster (together with the occurrence pointed by those values).In theory, it might happen that different institution/collection pairs share identical codes, although I don't think it will really happen. But if that's an issue we could just substitute the first two parts of the triplet (
institutionCode
:collectionCode
pair) by thedataset_key
provided by GBIF. Like this: dwc:otherCatalogNumbers = 1c334170-7ed1-11df-8c4a-0800200c9a66:44553Of course that means data curators must first check GBIF in order to know
dataset_key
values for the collections they want to link. But that is trivial using the api, and only needed once per collection (as I understand these gbif dataset keys are pretty stable).Once we know them its very easy to populate our
dwc:otherCatalogNumbers
with all the info we can have about specimens we have exchanged with other institutions (or taken in loan from them, if our GBIF dataset is related to a scientific paper which relays on specimens examined from different collections).These clusters would work no matter how those different institutions have filled in locations ("Spain" vs "España"), collectors ("T.Robertson", "Robertson, T.") or even taxonomic names (which may have been corrected in their databases, or in my data paper ... so all datasets in the cluster can benefit from knowing about each other).
I must say all this is just an idea because my institution is not filling in dwc:otherCatalogNumbers yet. Just because of that uncertainity about the best way to construct them (triplets? pairs?). Look at comments in
dwc:otherCatalogNumbers
definition:FMNH:Mammal:1234
,NPS YELLO6778
|MBG 33424
As you can see, the 1st example is a triplet separated by
:
, whereas the 2nd and 3rd are pairs separated by spaces. Not to say that 1st and 2nd are not separated by|
(the suggested separator). That's not a good example for a standardized way of doing it.As soon as GBIF clarificates a way to fill this in order to permit reliable clusters, we would indeed start doing it. The good thing is that we can concatenate as many values as needed (no need to substitute current ones). Perhaps GBIF could accept using
dataset_key
+catalogNumber
as an aditional pair for that field? So clusters might be based only on values which use that kind of codification. I am assumming that GBIFdataset_key
may be easily recognized by regex, and that we use a known separator (+
or whatever) prior thecatalogNumber
part.Thanks a lot @abubelinha