Creating providers - backend

koenedaele commented 1 year ago

This is a major update that will allow creating new providers without editing code or configuration as part of #724

We will split provider creation from conceptscheme editing so we can have a technical admin that can configure providers while regular users can edit conceptscheme metadata. This will also allow separating internal providers (with an editable conceptscheme) from external providers (such as Getty where the conceptscheme is uneditable)

Reading of providers will be handled in #790:

GET /providers: Read all configured providers
GET /providers/{id}: Read a single configured provider

Tasks:

Create a providers table in the DB
Create CRUD endpoints for providers:
- POST /providers: Register a new SQLAlchemyProvider, and create a conceptscheme for it as well unless a conceptscheme_id is provided.
- PUT /providers/{id}: Update a configured provider
- DELETE /providers/{id}: Delete a configured provider. Main question is, does this delete the conceptscheme and all concepts as well? Makes it a very dangerous operation.
Update https://github.com/OnroerendErfgoed/atramhasis/blob/develop/atramhasis/skos/__init__.py#L22 to read the configured providers from the DB

Creating a provider, immediately creates a conceptscheme as well. Changing the conceptscheme_id once it's been set is impossible.

Payload for a creating a SQLAlchemyprovider:

{
   'id': 'ERFGOEDTYPES', // string, required on edit, if not set on create will be equal to the id of the created conceptscheme, passes to provider.metadata.id
   'conceptscheme_uri': 'https://id.erfgoed.net/thesauri/conceptschemes' // string, required on create, will be used to create the conceptscheme, will be ignored on edit and is immutable once created (unless edited in database for people who really know what they're doing)
   'uri_pattern': 'https://id.erfgoed.net/thesauri/erfgoedtypes/%s', // string, required, needs to contain one %s, will be passed to a UriPatternGenerator at provider.uri_generator, should be not be changed once concepts have been created unless you really know what you're doing, perhaps add a warning in the UI
   'default_language': 'nl-BE', // string, not required, if not present or None don't pass to provider, if present must be exactly one IANA language string, passes to provider.metadata.default_language
   'subject': [], // list of strings, not required, if not present or None don't pass to provider, if present must be a list of strings, only valid value at the moment is 'hidden'
   'force_display_language: None, // string, not required, if not present or None don't pass to provider, if present must be exactly one IANA language string, , passes to provider.metadata.atramhasis.force_display_language,
   'id_generation_strategy: 'numeric, // string, not required, if not present or None don't pass to provider, if present must be one of numeric, guid or manual, passes to provider.metadata.atramhasis.id_generation_strategy,
}

In the DB this will require a table provider with the fields:

id, string, PK
conceptscheme_id, integer, required, FK to conceptscheme.id
uri_pattern, string, required
default_language: string, not required
subject: either a list serialised as string or something like an array, not required
force_display_language: string, not required
id_generation_strategy: string, not required or required and defaults to numeric

An alternative would be to have a table with an id field and put some or all of the config in a JSON field to make it easy to extend the configuration. Might work well since we don't really need to query for providers having language X or strategy Y. Maybe we'd query for providers with subject hidden, but I don't think so. And the number of providers in one instance is generally small, so reading them one by one is not a big deal. So, open to suggestions. If we do add the dataset attribute back in (see further down) I do expect that to be a JSON field.

So, to create a new provider, the smallest POST possible is:

{
  'conceptscheme_uri': https://id.erfgoed.net/thesauri/conceptschemes',
  'uri_pattern': 'https://id.erfgoed.net/thesauri/erfgoedtypes/%s'
}

In this case a new provider will be created, the id of the provider will be set to the DB id of the created conceptscheme (cast to a string), a urigenerator will be initialised and a conceptscheme will be created with the provided uri.

A more common way to create a provider would be:

{
  'id': 'ERFGOEDTYPES',
  'conceptscheme_uri': https://id.erfgoed.net/thesauri/conceptschemes',
  'uri_pattern': 'https://id.erfgoed.net/thesauri/erfgoedtypes/%s'
}

In this case a provider will be created with id 'ERFGOEDTYPES', a urigenerator will be initialised and a conceptscheme will be created with the provided uri.

Initialising the providers on a request will have to call a function that reads providers from the DB. As opposed to earlier deisgn, manual config of SQLAlchemyProviders will no longer be possible. Manual config is still necessary for external providers. So, we still expect the user to have a create_registry function, but for most users the default version will never need to be edited and will look something like:

def create_registry(request):
    try:
        registry = Registry(instance_scope='threaded_thread')

        registry = register_providers_from_db(registry, request)

        return registry
    except Exception as e:
        // handle errors

Someone who wants to add Getty would have something like:

def create_registry(request):
    try:
        registry = Registry(instance_scope='threaded_thread')

        registry = register_providers_from_db(registry, request)

        # use 'subject': ['external'] for read only external providers
        # (only available in REST service)

        getty_session = CacheControl(requests.Session(), heuristic=ExpiresAfter(weeks=1))

        aat = AATProvider(
            {'id': 'AAT', 'subject': ['external']},
            session=getty_session
        )
        registry.register_provider(aat)

        return registry
    except Exception as e:
        // handle errors

One drawback of our new implementation is that working with void datasets is no longer supported. However, this is an experimental feature that was never advertised to end users, so we're not really breaking anything. If someone absolutely wants to retain BC, they can still choose not to use the _register_providers_fromdb method to fall back to the old way. Or somehow add those old datasets somewhere after initialisation. Version 2.0.0 will deprecate the void.ttl support. A later version will try to integrate something like DCAT. So, for this feature it's easier to ignore it.

We will also need an upgrade path for version 1.0.0 users, something like a script that creates a provider in the database for every existing conceptscheme.

Wim-De-Clercq commented 1 year ago

My current proposal for the model would look like this:

class Provider(Base):
    id = Column(String, primary_key=True)
    conceptscheme_id = Column(
        Integer,
        ForeignKey(ConceptScheme.id),
        nullable=False,
    )  # metadata.conceptscheme_id or metadata.id
    metadata = Column(JSON, nullable=False)
    # + other things not in metadata like uri pattern.

    # For every known field which is taken out of metadata, define a hybrid_property
    @hybrid_property
    def default_language(self):
        return self.metadata.get("default_language")

    @default_language.setter
    def default_language(self, value):
        self.metadata["default_language"] = value

    # Same for expand_strategy, subject, atramhasis.force_display_language, atramhasis.id_generation_strategy

I would add metadata to the JSON, all fields which are already part of the JSON object will be removed from this metadata. For example, if the metadata contains {"default_language": "nl", "extra": "test"} then the provider JSON would contain

{
  "default_language": "nl",
  ...
  "metadata": {
    "extra": "test"
  }
}

If people were to send default_language in metadata, this would raise a bad request.

1 more thing which is not backwards compatible is the urigenerator, which could be anything and will now be exclusively a UriPatternGenerator

koenedaele commented 1 year ago

I know about the uripatterngenerator. Don't think we've ever see a use for another one? Other option would be to pass both the generator and it's arguments, but that seems like it can get complicated quickly?

Only other sensible form of uri patterns I can think of are uri's that are basically the document's url with and attached #id. Something like https://thesaurus.onroerenderfgoed.be/conceptschemes/TREES/c/1#id. Can be done with a pattern or with a generator that uses the pyramid routing to generate a url. That last one would be the cleanest solution I think, but more work than I want to handle right now.

koenedaele commented 1 year ago

@JonasGe and @Wim-De-Clercq What do we do about the DELETE provider?

Option 1: Delete the provider, but retain the conceptscheme and concepts. But since there's no way to create a new provider for an existing conceptscheme, this seems rather pointless.
Option 2: Don't delete the provider, but make it non-functional? Something like giving at a subject or some other kind of flag that it shouldn't be loaded by Atramhasis?
Option 3: Have it delete the provider and cascade to the conceptscheme and concepts. Makes it a dangerous operation to perform.

All in all, I think option 3 is the best one. I would like to see different permissions for CUD providers than creating concepts. Or is this something that's completely up to the implementor? I would assume we have editors who can edit a conceptscheme and create, edit and delete concepts and collections and admins who can create, edit and delete providers. Reading providers is allowed to all I would say.

JonasGe commented 1 year ago

I also think option 3 is the best. The documentation (and probably a comment block in the code) should clearly reflect that this is an irreversible action which will cascade delete every associated conceptscheme and concepts.

koenedaele commented 1 year ago

Once this feature and #732 land, we'll need to review all the docs anyway. A lot has changed.

Wim-De-Clercq commented 1 year ago

yes 3 is the best, the other 2 seem rather pointless. Option 2 could be achieved by disabling the view with permissions or something. It's better to have the option.

I would like to see different permissions for CUD providers than creating concepts

it's partially up to us. We have to define unique permission names on the views for the provider permissions vs the concept permissions.

We do, however, not implement a security factory. So by default these permissions are ignored. But an implementor can add one, and over there define which role gets which permission.

cedrikv commented 1 year ago

@Wim-De-Clercq I get an error when trying to delete a provider (created with the initializedb script): 2023-05-09 15:53:41,484 WARNI [atramhasis.views.exception_views][waitress-2] (sqlite3.IntegrityError) NOT NULL constraint failed: concept.conceptscheme_id [SQL: UPDATE concept SET conceptscheme_id=? WHERE concept.id = ?] [parameters: ((None, 14), (None, 15), (None, 16), (None, 17), (None, 18), (None, 19), (None, 20), (None, 21) ... displaying 10 of 73 total bound parameter sets ... (None, 85), (None, 86))] (Background on this error at: https://sqlalche.me/e/14/gkpj) 2023-05-09 15:53:41,486 INFO [pyramid_debugtoolbar][waitress-2] Squashed sqlalchemy.exc.IntegrityError at http://0.0.0.0:6543/providers/3 traceback url: http://0.0.0.0:6543/_debug_toolbar/313339373538393639333739353230/exception

OnroerendErfgoed / atramhasis

Creating providers - backend #744