CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Mangled distributions from WoRMS #483

Open dhobern opened 1 year ago

dhobern commented 1 year ago

See the distributions here: https://www.catalogueoflife.org/data/taxon/5W2WT

mrgid:marineregions.org , mrgid:mrgid , mrgid:48182 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48140 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48142 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48151 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48214 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48213 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48115 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48122 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48457 , mrgid:marineregions.org , mrgid:mrgid , mrgid:48366

This seems to be a set of triples with subjects, predicates and objects all concatenated as a single comma-separated string - we should work with WoRMS to get these flowing through in a better format and probably just as the objects (country names or ISO codes should be plausible in this case, but at very least URIs like http://marineregions.org/mrgid/48182, etc.).

yroskov commented 1 year ago

Dear @bart-v (@gdower) could you please have look on this?

mdoering commented 1 year ago

The ColDP distributions should consist of an identifier areaID and a controlled gazetteer value to know the context and optionally a human label area. See discussion at https://github.com/CatalogueOfLife/coldp/issues/40. The ColDP docs don't seem to line up, I'll update them.

For example in the case above one record with:

gazetteer=mrgid
areaID=48142 

The currently provided values from WoRMS are not that far off, they just use URLs for areaID which appears be the problem here. It is interpreted as a concatenation of multiple values:

https://www.checklistbank.org/dataset/1130/verbatim/308476

col:area = Ireland
col:areaID = http://marineregions.org/mrgid/48213
col:taxonID = urn:lsid:marinespecies.org:taxname:875546
col:gazetteer = mrgid
mdoering commented 1 year ago

@bart-v the distribution looks fine to me, I'll make sure we interpret the URL value as a single value. Showing the area name is a bit more difficult though - we will need to track the entire MRGID enumeration in the backend like we do for TDWG, ISO and other codes.

mdoering commented 1 year ago

@bart-v looking at MRGID it appears it actually refers to other standards like TDWG in this case: https://marineregions.org/gazetteer.php?p=details&id=48213

Relations: Part of Ireland (TDWG - level 3)
Has preferred alternative Ireland (Nation) [view hierarchy]

Is MRGID not a standard on its own but rather a managed collection of other standards as linked data for placenames? I am confused that there is a preferred alternative given.

mdoering commented 1 year ago

Reopening as the issue is adressed in the backend code, but not in the data. Potentially all WoRMS sources should be reimported and resynced now.

bart-v commented 1 year ago

This issue has been there for quite a while, but I always assumed what we send is just OK. Great to see that confirmed & thanks for looking into this now.

MarineRegions (MRGID) is indeed a mixture of multiple existing standards, plus a multitude of entries that are not part of any other standard at all: obviously and especially marine place names.

The "preferred alternative" is just there to indicate the preferred MRGID within the standard, not a link to an external standard.

Thus, I think COL should consider this as a separate standard, especially since we assign proper PIDs to it.

mdoering commented 1 year ago

Thanks @bart-v, could you explain a little more how MarineRegions works? If I understand correctly it is assembled from all these sources here: https://marineregions.org/sources.php

When/how often does it change? Are their distinct releases with versions? I can only seem to be able to download individual sources, but not the entire MarineRegions. Is it available somewhere e.g. to lookup a region name from an MRGID?

bart-v commented 1 year ago

MarineRegions is updated constantly, just like WoRMS. What is mentioned under the sources page is just a subset of the entries, i.e. the bulk, but not all of them.

For machines, we have multiple ways to access the data, i.e. a Linked Data Event Streams (LDES) feed https://www.marineregions.org/gazetteer.php?p=webservices

mdoering commented 1 year ago

Great. To seed a system before using LDES you would use REST or does LDES provide you with simple ways to access a "snapshot"? I havent used LDES before, looks useful.

bart-v commented 1 year ago

LDES will sync anything that is new for the client. If there is nothing, it will sync everything. So no REST needed.