Closed rdmpage closed 2 years ago
referenceID
or nameReferenceID
in any table should resolve to Reference.ID
. Sounds like there is sth wrong in the export then.
For example from https://api.checklistbank.org/dataset/9817/taxon/7ZPP8:
{
"id":"7ZPP8",
"name":{
"id":"49a39406-b04f-4538-9c15-9fbbf1b19da1",
"scientificName":"Maladera (Maladera) sinica",
"authorship":"(Hope, 1842)",
"rank":"species",
"publishedInId":"9a6f79af-3f19-4dc4-b4fc-d557b52d55b2",
},
"status":"accepted",
"parentId":"7ZLR6",
"referenceIds":[
"9a6f79af-3f19-4dc4-b4fc-d557b52d55b2",
],
}
That resolves to https://api.checklistbank.org/dataset/9817/reference/9a6f79af-3f19-4dc4-b4fc-d557b52d55b2 I will look into why the archive has different data...
Oh, I may have spoken too soon! I looked at the first few rows of the dump, saw integer ids and thought there's no UUIDs. But
grep "9a6f79af-3f19-4dc4-b4fc-d557b52d55b2" Reference.tsv
returns the reference above. My bad, assumption is the mother of all f***ups.
FYI I have a reference parser that is inspired by https://anystyle.io and which parses https://api.checklistbank.org/dataset/9817/reference/9a6f79af-3f19-4dc4-b4fc-d557b52d55b2 in more detail. The tool is at https://citation-parser.herokuapp.com, I've trained it on some standard sources and some taxonomic references as well. Not perfect, I hope to test it on CoL at some point.
This (and many other references) are also in Wikidata, at some point it might be nice to add Wikidata ids to CoL references (assuming the reference ids are stable).
Is your parser using the anystlye code just with a different training?
No, it's using a PHP rewrite of Perl code based on ParsCit which uses a different CRF engine (I couldn't get the one any style uses to compile on my Mac). Plus I couldn't figure out how to run anystyle as a web service. My code is here citation-parsing.
However, my original training data originally came from anystyle and so I use their XML format for training. I also have a simple editor to make additional training data which I use to mark up references that fail to parse correctly. Super crude but might be of interest http://citation-parser.herokuapp.com/editor.html. Although I haven't tested it, I'm assuming my expanded training data should work with anystyle.
Final question, are the reference ids stable? Do they persist across releases of CoL? Just wondering whether I should attempt to link to external identifiers, which would only make sense if CoL identifiers persisted.
No, unfortunately they are not stable at this stage. Whenever you see UUIDs these are unstable identifiers...
I thought this might be the case :(
On 3 May 2022, at 16:06, Markus Döring @.***> wrote:
No, unfortunately they are not stable at this stage. Whenever you see UUIDs these are unstable identifiers...
— Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/97#issuecomment-1116209158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2XP3BQD7OXWTMUAIY3VIE6I5ANCNFSM5U6OAXOA. You are receiving this because you authored the thread.
I definitely want to make that a stable one, also the names IDs. But it is not on my immediate priority list I am afraid and nothing I can do in a few hours.
I understand, and as it stands I can generate most of the RDF I need from the existing dump. One thing I can't do is generate name relationships such as basionym. The NameRelationship file has pairs of UUIDs (I'm presuming these are internal identifiers for names) but these don't seem to be in the NameUsage table. Unless I'm missing something there's no way to make the link between a taxon, its names, and the relationships of those names given the current data dump?
Sorry to be a pain, I just have this fantasy of being able to recreate the CoL interface using RDF (partly as a way to check that the RDF I'm generating makes sense).
The CLB model differes between Names and NameUsages, e.g. Taxon or Synonym. Both have their own identifiers and so far we have only managed the usage IDs to assure they are stable. Hence you still get volatile name identifiers as UUIDs. Similar to references names are on my list to become stable ids, but I can t say when that will happen.
https://api.checklistbank.org/dataset/9817/taxon/7ZPP8
The ColDP archive should also have a nameID, but that seems to be missing. I will add that soon....
Following on from https://github.com/CatalogueOfLife/portal/issues/185 I'm contemplating generating JSON-LD from the CoL data dump (rather than scrape JSON-LD from every page :wink: ).
To do this I want to link references to names so that I can recreate
schema:isBasedOn
but there doesn't seem to be a link between NameUsage.tsv and Reference.tsv. NameUsage.tsv has columnscol:referenceID
andcol:nameReferenceID
which have UUIDs but these don't seem to occur anywhere else. Reference.tsv has local integer ids but not these UUIDS. Unless I'm missing something this means there is no way to link names to literature from the ColDP Archive download?As it stands I can't generate JSON-LD from the data dump 😞