Global Resource Expiration System

USER STORY

As a developer I want to create temporary resources which expire after a certain amount of time. The following must apply:

resources have a duration or TTL (time to live)
a resource can have its TTL extended
once TTL expires the resource is removed

Current Implementation

Submodule resource_manager in the webserver service:

Uses: Redis
Concept:
- Resources counting:
  - for each connected user WITH websocket creates an entry of type HASH-KEY (Redis dict kind) using the user ID and client session ID (corresponds to one browser tab)
  - Each resource of the client is added to that HASH-KEY (currently socket.io ID, project UUID)
- User connection (alive?):
  - An "Alive" key is created for the user at websocket connection
  - The JS client refreshes the "Alive" key at half the TTL rate (15/2Minutes)
  - At user disconnection the "Alive" key gets a TTL of 15Minutes
  - If the user reconnects, then the "Alive" key TTL is removed
- Garbage collection:
  - Runs as background task
  - Checks the Redis for entries
  - In case there are no "Alive" key and opened resources (Projects) -> closes project and remove redis entries
    - In case the user is a "GUEST" it user projects and files also get removed.

Shortcomings:

It runs on a schedule only
In case several webserver instances are up and running several garbage collectors might conflict (to be tested)

Observations

Systems which might need resource expiration:

Postgres table entries
MinIO entries
Redis entries
Containers handled by the director

Useful features to have:

it should be possible to create an interdependency between resources, thus create a "container of heterogeneous types" of resources (effectively grouping them into TTL sessions) [suggestion @sanderegg] => this is done in Redis using the Hash-key The redis-commander offers a GUI that shows all the current resources in use
a TTL session will manage the TTL of each individual resource; each resources inherits the session's TTL
a TTL session can have its duration extended
once the TTL session expires all resources are removed [suggestion @pcrespov]=> or MARKED as expired/obsolete. Perhaps we could refine the concept by adding whether the resource MUST be removed IMMEDIATELY or it can be PROCRASTINATED (delayed removal). In situations of heavy load or when a fast response is needed the latter is very very useful
each type of resource has its implementation for handling the removal (effectively deleting itself from the system)[needs more attention]=> both @pcrespov and @sanderegg have expressed doubts or have questions, further discussion is required

Motivation

Because generally all resources are tied to a user or some user activity it makes sense to bundle them in a "TTL session".

Example 1: Thinking about "GUEST users"; their activity can be traced via a "TTL session" and all their resources could be purged from the system once the user is no longer active.
Example 2: Another use case could be to guarantee that the director's containers do not remain active after the user suddenly disappears. The activity of each user will be bound to a "TTL session". Only the temporary resources (eg: container) will be set to have an expiration. The system will take care of cleaning if something goes wrong, ensuring important resources always get recovered.
Example 3: Create a .csv data exporter (or in general temporary data exporter) which automatically removes the file after a set amount of time. Just add a new type of resource and set the TTL the exported data file is created. The system takes care of the rest.

Implementation proposal

Create a separate service which runs in a separate container. We expose it via an API to internal services. Also a python module should be created which interacts directly with the API. This service is responsible for checking resource TTLs and removing resources when they expire. The service must have access to the systems which must be garbage handled (eg: Postgres, MinIO, etc...) For each resource type a "destroy" callback must be implemented which actively removes the resource from the system. This is the only piece of code which has to be actively maintained in case of changes.

[suggestion @sanderegg] I agree we should have one service for resource management in a mid-term timing. But this sounds like much more to do in terms of refactoring and we have an issue right now that could be fixed pretty fast.

My suggestion here would be:

first fix the 2 issues using the current webserver as it is (see my fix suggestions above) - we need these problems resolved
Create a separate service (which will for sure mean quite a bit of refactoring)

Possible side effects

the application might not be tailored to handle resources which "disappear", but this might help us improve stability and error handling
a new service needs to be maintained as a standalone project
if not careful with SQL constraints, some issues might apply, but we can test for those
some code for resource removal might get duplicated or it could just be moved in a common shared library to avoid this scenario

This issue was created from the following document

ITISFoundation / osparc-simcore