Closed koenedaele closed 1 year ago
See OnroerendErfgoed/atramhasis#475
Hi there! This is quite problematic, because in SKOS and or RDF, we are trying to move away from local textual identifiers. Hence, adding a dct:identifier
or something similar is a no go. Even if we did, what would they mean in the long run and outside of atramhasis? And how can you tell that the object of dct:identifier
is numeric if the datatype is missing? This issue is currently blocking.
I researching this I noticed we wrote some of these things almost ten years ago, so I'm not sure I always remember why we did what. We did indeed make compromises so that the skosprovider interface could handle python dicts, CSV files, relational databases and RDF files.
As much as I love URI's, for certain local operations, a simple ID works very well and makes sense to devs. A lot of the thesauri I know, even the AAT eg., do export their local identifiers as well. The skosprovider interface requires every concept to have a unique ID, but this can be any string, even the URI. So the skosprovider_rdf interface, when importing an RDF file, checks for dct:identifier and dc:identifier and if those are not present, falls back to the uri. So, using the skosprovider_rdf python class outside of Atramhasis with an RDF file that does not contain any identifiers, should work as you expect it.
Importing is done by reading something with whatever provider supports the file or service format, mapping it to the skosprovider interface and then using the skosprovider_rdf.utils.rdf_c_dumper on a concept or collection. The dumping function has no idea if that concept or collection was loaded from rdf, csv, sql or something else. Which is mostly a good thing, but here it's making life a bit more complicated.
Still, from looking at the code it's probably not as bad as it looks. There are a few things to tackle:
(This actually seems like a part skosprovider_sqlalchemy issue and part Atramhasis issue, so I might move the discussion back to that repo. Changing the concept_id field to a String, is a major BC compatibility break though, so we need to be careful with it)
I researching this I noticed we wrote some of these things almost ten years ago, so I'm not sure I always remember why we did what. We did indeed make compromises so that the skosprovider interface could handle python dicts, CSV files, relational databases and RDF files.
That's perfectly understandable of course, just trying to get an overview of what they are.
As much as I love URI's, for certain local operations, a simple ID works very well and makes sense to devs. A lot of the thesauri I know, even the AAT eg., do export their local identifiers as well.
Sure, they are definitely useful in practice. Having to depend on them is another thing though (but that's not the case here anyway).
The skosprovider interface requires every concept to have a unique ID, but this can be any string, even the URI. So the skosprovider_rdf interface, when importing an RDF file, checks for dct:identifier and dc:identifier and if those are not present, falls back to the uri. So, using the skosprovider_rdf python class outside of Atramhasis with an RDF file that does not contain any identifiers, should work as you expect it.
Importing is done by reading something with whatever provider supports the file or service format, mapping it to the skosprovider interface and then using the skosprovider_rdf.utils.rdf_c_dumper on a concept or collection. The dumping function has no idea if that concept or collection was loaded from rdf, csv, sql or something else. Which is mostly a good thing, but here it's making life a bit more complicated.
Still, from looking at the code it's probably not as bad as it looks. There are a few things to tackle:
- Internal storage. As you noticed, skosprovider_sqlalchemy only supports Int for the concept_id field. This would need to be changed to String. The actual db relations between tables already use a proper autogenerated id, so changing this would not break anything. I'm actually wondering why we went with Int for the concept.concept_id field. It does seem like that could have been String from te beginning. I have a vague recollection of there not even being two fields in the first prototypes, so maybe we ran out of time to go all the way. Perhaps it had something do to with sorting, since sorting by numeric id's stored as text produces results users generally don't like. Still, sorting by ID is not that common for this app. So, step 1 would be changing the concept.concept_id from Int to Str. Alternative would be to have a new field concept.concept_id_str or so for textual id's, but that would require a lot of testing to determine which concept_id to use.
So this was what mainly what was not clear to me. But it seems it's easily fixed by switching the datatype then! The line https://github.com/OnroerendErfgoed/skosprovider_sqlalchemy/blob/develop/skosprovider_sqlalchemy/utils.py#L49 is just not compatible with the logic in _get_id_for_subject()
which does implicitely forces you to add a local identifier if you want to use atramhasis or any other application of the skosprovider_sqlalchemy
.
- UI. Currently, the ID is non-editable in Atramhasis. It gets created by the application. We could create random textual id's or guids, but some people certainly like to choose for themselves (such as in importing conceptscheme information when reading from RDF file atramhasis#475). So there might be 3 common ID generation strategies: ascending numeric, guid, manual text. Every provider would need to be configured to know how to generate id's. I agree that local id's are not required in RDF, but most people seem to prefer creating URI's by adding a local id to a base URI that makes it unique. That might not be best practice, but it happens a lot, so we can't forbid it. It's how the standard uri_generator works as well (template string+concept_id).
That's actually fine. In fact, we do that too (sometimes) and would like to keep the ability to do that. Problems arise when software considers the string part alone as the real identifier and not the URI. The fact that it's part of an URI forces it to be globally unique.
We could go for fully manual editing of URI's, but I don't think any regular user wants to be typing hostnames over and over.
I agree and I don't think that should be default behaviour. The UI, however, should be able to handle URIs that were created outside of Atramhasis without minting new ones or creating derivatives. And not often, but it happens, you need to manually rewrite the automatically generated URI path for some practical reason.
Even in a vanilla skos file I think the URI will always contain some form of local ID (even if it's a guid). And once you want the user to be able to add new concepts to the scheme, the application does need to know how to construct URI's for that scheme. So, step 2 would be allowing a system admin to choose how to generate a local identifier (number, guid, manual entry).
Preferable yes, but I'm fine with the default numbering for newly created concepts (URIs are opaque anyway).
- Glueing it all together. There's probably some more minor things to think about like importing data that combines ID generation strategies and how this all fits together.
Let me know if we can help!
(This actually seems like a part skosprovider_sqlalchemy issue and part Atramhasis issue, so I might move the discussion back to that repo. Changing the concept_id field to a String, is a major BC compatibility break though, so we need to be careful with it)
Okay, let's continue the UI part there. I'll open a new one here on changing from int to string.
UI discussion has been moved to Atramhasis, and the changing of the database is now in #87 . Closing this one.
skosprovider_sqlalchemy expects all identifiers for concepts or collections to be numeric. Trying to import a provider with non-numeric id's anyway leads to a ValueError. Look into importing non-numeric id's or better error handling.