InformaticsMatters / react-sci-components

Resusable react components for scientific applications
Apache License 2.0
2 stars 0 forks source link

Data tier MVP requirements #20

Open tdudgeon opened 4 years ago

tdudgeon commented 4 years ago

This issue summarises the requirements of a MVP data tier. This is a server side application that provides data and services to mini-app applications such as the pose viewer.

Mini-apps licensing and security

The data tier is the glue in the mini-apps ecosystem.

Authentication

Use of mini-apps and the data tier should be restricted to people who have logged in. Authentication and Authorisation will be done using the existing Keycloak environment providing SSO across these and other IM apps. Users will be able to register themselves.

Data visibility

In the public version data for all users is publicly visible. Any user will be able to get a list of registered users and see and use their datasets. A private instance will allow data access to be restricted so that data can be kept private.

Licensing

The mini-app components will be licensed with a permissive license allowing them to be used freely. The individual mini-app applications will be available under two licenses:

  1. an open license that allows the application to be deployed and used without charge but obligates that all data is public
  2. a commercial license that allows the application to be deployed and used in a public or private setting and allows data to be kept private.

Key functions

User management

  1. A user can register
  2. A user can see a list of other users

Dataset upload

  1. A user can upload a dataset from a local file
  2. A user can load a dataset from a HTTP(s) or FTP URL

Data types (e.g. a media type such as chemical/x-mdl-sdfile) would be determined where possible for the file extension or Content-Type header, but the user can override this. User can also give the dataset a simple name, a detailed description and zero or more labels that can be used as filters. The date of creation and hash code of the data should be recorded. Data is immutable. If it needs to be changed a new version is created.

Initial data types to be supported:

Dataset fetch

Each dataset has a URL that can be used to fetch the dataset from any mini-app. This includes datasets from other users (assuming user that dataset is visible to the user).

A dataset can be fetched in a number of formats that can be requested using the Accept-Type header. For instance, a dataset that was uploaded form a SDF file could be fetched in SDF format (chemical/x-mdl-sdfile) or in Squonk JSON format (application/x-squonk-dataset-molecule, along with it's corresponding metadata).

Explore datasets

User can list their datasets, including applying filters for data types, date and labels. User can do the same for datasets from other users if they are visible to them.

This API should support exploring datasets from any application e.g. another mini-app will want to be able to list and filter the datasets that are available.

Dataset share

In the public version all datasets are visible to all users. In a private environment datasets are private by default but can be made public or shared with one or more users.

Services

Data tier should allow execution of services using the datasets as input. The result of execution is a new dataset, with a record or how it was generated.

Exact details need clarifying, but these sorts of services should be possible (though not initially):

  1. Squonk/Pipelines services
  2. Dataset manipulations (e.g. merge/filter datasets probably using Pandas data frames)
alanbchristie commented 4 years ago

The requirements for the data tier can be documented here but if the data tier implementation itself is not going to be publicly accessible maybe it's detailed design discussions should take place under a mini-apps-data-tier private repo in GitLab where this issue can be cross-referenced?

alanbchristie commented 4 years ago

I've created a branch for this issue as I'd like the API definition to be formalised in an OpenAPI doc, and very basic version will be submitted soon. The data tier API can (and should) be public - i.e. held in this repo - with the implementation (in another repo) using it.

alanbchristie commented 4 years ago

For the time-being the AIPI is emerging as the file data-tier-api/openapi-0.1.yaml (on the branch 20-data-tier-mvp-requirements). A it stands you can...

There's currently no authentication, file type support, quotas etc.

alanbchristie commented 4 years ago

If the user-managemnmt (quotas, extra attributes we'll need to support in the data tier) cannot be handled by keycloak user attributes then management of users is going to much, much more efficient in a framework like django. As the model becomes more complex we will need to switch to an alternative framework where authentication and user/data modelling are built-in. Django is especially mature and has full model-view-controller support andsophisticated authentication with a management console built-in. So, frameworks for consideration: -

And django's keycloak integration: -

alanbchristie commented 4 years ago

Some thoughts based on the REST framework developed this week...

Although it works we're rapidly entering areas of design and implementation that are much better handled by existing web-app frameworks.

Its seems to be becoming clear that we're going to need an additional database backend (i.e. one alongside the Keycloak database) and we'll need to offer the ability for users to probably provide their own "online" handle (sharing data using the user's Keycloak username or email isn't going to be acceptable) and annotate, describe and label their data and create sharing "circles" that they can use to share data. Doing this via REST is possible but painful and (arguably in a world of frameworks) wrong.

Certainly writing a new application (from scratch) to provide these user services is less efficient and reliable that using pre-existing frameworks. Django, for example, allow us to build responsive web applications that are secure, provide authentication and database modelling and migrations for free (other similar frameworks may exist).

A lightweight REST service is clearly needed (to fetch data) but I suspect the vast majority of management and user interaction might be better handled by building an app around a competent framework like Django.

Summary: -

I must stress this is an instinctive reaction to this week's effort. But we need to pause and understand how the user is going to interact with this min-apps data service. A management console just seems "cleaner".

django-solution 001

A CAUTIONARY NOTE The django solution needs to be investigated fully as although the keycloak authentication articles look competent some are 2 to 3 years old. In software evolution terms this is more than a lifetime! The most expensive part of the REST api development so far has been attaching it to the authentication backed. Providing a functional service has been relatively rapid compared to that. Everything "appears" to be simple and you can always find an (outdated) article that appears to say "just do this" ... then ... a day and a half later ... finally you can connect keycloak to your REST API!?

tdudgeon commented 4 years ago

I agree that Django might be a good choice for building the data tier backend, but I don’t see it as a choice between a REST API or Django - Django can power the REST API, and if it gives additional features that we need such as database management then all the better.

For user data I hope that all of this can be stored in Keycloak. I see nothing at present that can’t, though it does slightly restrict how we do handle this (e.g. we probably have to use roles to handle things we don’t want the user to be able to edit).

What is key is that we keep each mini-app lightweight and decoupled as far as possible from other apps. Most if not all mini-apps will need to interact with the data tier, but this should be restricted to:

  1. Getting information about users
  2. Handling files/resources