ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

4463 references like urn:catalog:CSULB:Aves:989 found, but expected reference like urn:catalog:CSULB:Bird:989 instead (Aves -> Bird) #5639

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

Hi!

During an automated GloBI review, our friendly bots failed to resolve some references related to (suspected) collection CSULB:Bird .

For some reason, prefix CSULB:Aves was seen over 4k times, instead of expected CSULB:Bird .

I've attached a short list of example below:

curl https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/vertnet/review.tsv\
 | grep "CSULB:Aves"\
 | cut -f6\
 | head

yielded:

found unresolved reference [urn:catalog:CSULB:Aves:1]
found unresolved reference [urn:catalog:CSULB:Aves:10]
found unresolved reference [urn:catalog:CSULB:Aves:1000]
found unresolved reference [urn:catalog:CSULB:Aves:1001]
found unresolved reference [urn:catalog:CSULB:Aves:1002]
found unresolved reference [urn:catalog:CSULB:Aves:1003]
found unresolved reference [urn:catalog:CSULB:Aves:1004]
found unresolved reference [urn:catalog:CSULB:Aves:1005]
found unresolved reference [urn:catalog:CSULB:Aves:1006]
found unresolved reference [urn:catalog:CSULB:Aves:1007]

for full list, see: https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/vertnet/review.tsv

and search for CSULB:Aves

Please let me know if you need more information to resolve this suspected typo.

jhpoelen commented 1 year ago

If you'd like, I can make a little rule saying that CSULB:Aves should be interpreted as CSULB:Bird, but then others might not be able to resolve the reference in the original data.

dustymc commented 1 year ago

@jhpoelen could you provide a link to an example? I can't find anything like this in Arctos and I'm not sure how to interpret the identifiers in your file.

jhpoelen commented 1 year ago

For sure:

here's an example extracted from

preston track "http://ipt.vertnet.org:8080/ipt/rss.do"\
 | preston dwc-stream\
 | grep "CSULB:Aves"\
 | head -n2\
 | jq .
{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/4cbef2cdf2b7b96371b7727bf8bfc4f454addfa6c5c25a4c51f7f90cf0ad2522!/resourcerelationship.txt!/L2",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/dwc/terms/ResourceRelationship",
  "http://rs.tdwg.org/dwc/text/coreid": "http://arctos.database.museum/guid/CSULB:Bird:1?seid=5523838",
  "http://rs.tdwg.org/dwc/terms/resourceID": "http://arctos.database.museum/guid/CSULB:Bird:1?seid=5523838",
  "http://rs.tdwg.org/dwc/terms/relatedResourceID": "urn:catalog:CSULB:Aves:1",
  "http://rs.tdwg.org/dwc/terms/relationshipOfResource": "SameAs"
}
{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/4cbef2cdf2b7b96371b7727bf8bfc4f454addfa6c5c25a4c51f7f90cf0ad2522!/resourcerelationship.txt!/L3",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/dwc/terms/ResourceRelationship",
  "http://rs.tdwg.org/dwc/text/coreid": "http://arctos.database.museum/guid/CSULB:Bird:3?seid=5524160",
  "http://rs.tdwg.org/dwc/terms/resourceID": "http://arctos.database.museum/guid/CSULB:Bird:3?seid=5524160",
  "http://rs.tdwg.org/dwc/terms/relatedResourceID": "urn:catalog:CSULB:Aves:3",
  "http://rs.tdwg.org/dwc/terms/relationshipOfResource": "SameAs"
}

Is this expected?

jhpoelen commented 1 year ago

where

preston alias\
 | grep "hash://sha256/4cbef2cdf2b7b96371b7727bf8bfc4f454addfa6c5c25a4c51"
<http://ipt.vertnet.org:8080/ipt/archive.do?r=csulb_bird> <http://purl.org/pav/hasVersion> <hash://sha256/4cbef2cdf2b7b96371b7727bf8bfc4f454addfa6c5c25a4c51f7f90cf0ad2522> <urn:uuid:e9c8fd3f-6c39-43a4-9d00-8ac18ae341a4> .
dustymc commented 1 year ago

OH, thanks!

I'm relatively sure that's part of a larger mess, maybe @Jegelewicz can further illuminate.

I'd like to tell you that Arctos identifiers either resolve or don't and trying to magick resolvable IDs out of things that look like they might be coerced into becoming something more would be silly and pointless, but that's not entirely true: https://github.com/ArctosDB/arctos/discussions/5310. I think in this case that is true, however - that urn is just a urn, not a url and not anything from Arctos.

jhpoelen commented 1 year ago

@dustymc thanks for having a peek at this.

Now, I am curious about how these sameAs relationships came about, and how I should interpret them.

dbloom commented 1 year ago

@jhpoelen CSULB Birds was published some years ago independently of Arctos. At that time, we generated a DwC Triplet for occurrenceID. Aves was the collectionCode back then. Since then CSULB has become a member of Arctos. Arctos uses "bird" instead of "aves", plus they generate their own occurrenceIDs. When this collection was republished through Arctos we published with the Resource Relationship Extension to keep the breadcrumbs for data users so that they would know that CSULB:Bird:xxxxxxxx was once CSULB:Aves:xxxxxxxxxx. That is why you get the results you seen and where the sameAs comes from. If you are getting double the number of records as you might expect, then it may be that whatever resource you are using does not process the RR Extension or has not removed the old records with the old occurrenceIDs.

jhpoelen commented 1 year ago

@dbloom thanks for sharing the history behind the CSULB:Birds vs. CSULB:Aves . Neat that you've encoded a change of address from the old to the new, and that you've left breadcrumbs for future generations.

So now, I'll be thinking about the following question in the days to come: What is the meaning of some identifier if the context in which they existed (in this case: in your memory) is implicit?

dustymc commented 1 year ago

done?

jhpoelen commented 1 year ago

@dustymc thanks for sharing the context around the CSULB:Bird vs CSULB:Aves URNs. Just opened a related issue https://github.com/globalbioticinteractions/globalbioticinteractions/issues/927 .