CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

Lack of col:nameReferenceID in Reference.tsv means I can't link names and references #97

Closed rdmpage closed 2 years ago

rdmpage commented 2 years ago

Following on from https://github.com/CatalogueOfLife/portal/issues/185 I'm contemplating generating JSON-LD from the CoL data dump (rather than scrape JSON-LD from every page :wink: ).

To do this I want to link references to names so that I can recreate schema:isBasedOn but there doesn't seem to be a link between NameUsage.tsv and Reference.tsv. NameUsage.tsv has columns col:referenceID and col:nameReferenceID which have UUIDs but these don't seem to occur anywhere else. Reference.tsv has local integer ids but not these UUIDS. Unless I'm missing something this means there is no way to link names to literature from the ColDP Archive download?

As it stands I can't generate JSON-LD from the data dump 😞

mdoering commented 2 years ago

referenceID or nameReferenceID in any table should resolve to Reference.ID. Sounds like there is sth wrong in the export then.

For example from https://api.checklistbank.org/dataset/9817/taxon/7ZPP8:

{
"id":"7ZPP8",
"name":{
  "id":"49a39406-b04f-4538-9c15-9fbbf1b19da1",
  "scientificName":"Maladera (Maladera) sinica",
  "authorship":"(Hope, 1842)",
  "rank":"species",
  "publishedInId":"9a6f79af-3f19-4dc4-b4fc-d557b52d55b2",
},
"status":"accepted",
"parentId":"7ZLR6",
"referenceIds":[
  "9a6f79af-3f19-4dc4-b4fc-d557b52d55b2",
],
}

That resolves to https://api.checklistbank.org/dataset/9817/reference/9a6f79af-3f19-4dc4-b4fc-d557b52d55b2 I will look into why the archive has different data...

rdmpage commented 2 years ago

Oh, I may have spoken too soon! I looked at the first few rows of the dump, saw integer ids and thought there's no UUIDs. But

grep "9a6f79af-3f19-4dc4-b4fc-d557b52d55b2" Reference.tsv 

returns the reference above. My bad, assumption is the mother of all f***ups.

rdmpage commented 2 years ago

FYI I have a reference parser that is inspired by https://anystyle.io and which parses https://api.checklistbank.org/dataset/9817/reference/9a6f79af-3f19-4dc4-b4fc-d557b52d55b2 in more detail. The tool is at https://citation-parser.herokuapp.com, I've trained it on some standard sources and some taxonomic references as well. Not perfect, I hope to test it on CoL at some point.

This (and many other references) are also in Wikidata, at some point it might be nice to add Wikidata ids to CoL references (assuming the reference ids are stable).

mdoering commented 2 years ago

Is your parser using the anystlye code just with a different training?

rdmpage commented 2 years ago

No, it's using a PHP rewrite of Perl code based on ParsCit which uses a different CRF engine (I couldn't get the one any style uses to compile on my Mac). Plus I couldn't figure out how to run anystyle as a web service. My code is here citation-parsing.

However, my original training data originally came from anystyle and so I use their XML format for training. I also have a simple editor to make additional training data which I use to mark up references that fail to parse correctly. Super crude but might be of interest http://citation-parser.herokuapp.com/editor.html. Although I haven't tested it, I'm assuming my expanded training data should work with anystyle.

rdmpage commented 2 years ago

Final question, are the reference ids stable? Do they persist across releases of CoL? Just wondering whether I should attempt to link to external identifiers, which would only make sense if CoL identifiers persisted.

mdoering commented 2 years ago

No, unfortunately they are not stable at this stage. Whenever you see UUIDs these are unstable identifiers...

rdmpage commented 2 years ago

I thought this might be the case :(

On 3 May 2022, at 16:06, Markus Döring @.***> wrote:

No, unfortunately they are not stable at this stage. Whenever you see UUIDs these are unstable identifiers...

— Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/general/issues/97#issuecomment-1116209158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2XP3BQD7OXWTMUAIY3VIE6I5ANCNFSM5U6OAXOA. You are receiving this because you authored the thread.

mdoering commented 2 years ago

I definitely want to make that a stable one, also the names IDs. But it is not on my immediate priority list I am afraid and nothing I can do in a few hours.

rdmpage commented 2 years ago

I understand, and as it stands I can generate most of the RDF I need from the existing dump. One thing I can't do is generate name relationships such as basionym. The NameRelationship file has pairs of UUIDs (I'm presuming these are internal identifiers for names) but these don't seem to be in the NameUsage table. Unless I'm missing something there's no way to make the link between a taxon, its names, and the relationships of those names given the current data dump?

Sorry to be a pain, I just have this fantasy of being able to recreate the CoL interface using RDF (partly as a way to check that the RDF I'm generating makes sense).

mdoering commented 2 years ago

The CLB model differes between Names and NameUsages, e.g. Taxon or Synonym. Both have their own identifiers and so far we have only managed the usage IDs to assure they are stable. Hence you still get volatile name identifiers as UUIDs. Similar to references names are on my list to become stable ids, but I can t say when that will happen.

https://api.checklistbank.org/dataset/9817/taxon/7ZPP8

The ColDP archive should also have a nameID, but that seems to be missing. I will add that soon....