Possible future development: towards a centralized storage infrastructure ?

lucasgautheron commented 2 years ago

(This is WIP)

Is your feature request related to a problem? Please describe.

The current design in the one described in Managing, storing, and sharing long-form recordings and their annotations.

It can be summed up this way:

ChildProject is a python+CLI interface for interacting with corpora of child-centered daylong recordings
DataLad is a python+CLI interface for managing scientific datasets (such as these corpora)

There are a few issues that remain unsolved by this design:

Neither of these tools provide the infrastructure to store the data and to process them. Yet, this can be challenging for such data.
Although it is great in many aspects (especially versioning and reproducibility), DataLad may be a bit technical to use for some of the users interested in managing such corpora. Although not required by ChildProject - which works so long as the files are structured properly -, it is the only solution proposed in our design for the retrieval and upload of data.
Not all storage supports handle complex permissions really well - also DataLad is not great if you have too many groups either

Although there are advantages to decentralization, these limitations call for (at least one) centralized database of daylong recordings.

I'll discuss two alternatives:

lucasgautheron commented 2 years ago

Towards a web-oriented, git-less database

Here is a description of the long term goals.

ChildProject should be able to interact with corpora using different storage supports:

Locally using the current standards (metadata as CSV dataframes), which we'll call the CSV interacting mode
Locally using a database (e.g. sqlite, PgSQL, etc.), i.e. the database interacting mode
Remotely through an API, i.e. the API interacting mode

For instance, the third option would apply to users of a centralized database. Most functionalities should work the same regardless of the storage support (i.e. one client API for all storages).

The centralized database (let's call it daylong-db) would come with include two packages:

daylong-db-server: an API server complying with the standards of ChildProject's API storage interface, e.g. to serve the requests made through the API interacting mode of ChildProject. Could be run on Flask. It could also use ChildProject through the database interacting mode for instance.
daylong-db-frontend: a web-based graphical interface for interacting with the corpora (as an alternative to ChildProject which is a python/CLI interface)

It should be possible to run processing pipelines remotely too. The API would return a handler which could be used to check for the status of the job at anytime and to retrieve the results. On the server side, the jobs could be run using slurm on a local infrastructure or on a cloud computing provider such as AWS.

Note that it should be possible to convert corpora from any storage format to any other (e.g. export/import from CSV to DB etc.)

Roadmap

Develop ChildProject database interacting mode (one in which CSVs are replaced with tables)
Design specifications for an API interacting mode
Implement it into ChildProject, ignoring processing pipelines for the beginning
Implement a server capable of serving these requests (i.e. daylong-db-server)
Factorize pipelines code
Implement processing pipelines into daylong-db (using slurm to begin with)
Develop a web-based graphical interface for daylong-db

Implementation

Stores

A store is an object that can fetch/updates data from a given storage (local or remote, CSV vs SQL, etc.)

We'd have:

CSVStore
SQLStore
APIStore

Which would all inherit from a Store abstract class, e.g.:

class Store(ABC):

    def __init__(self):
        pass

    @abstractmethod
    def get_children(self):
        pass

    @abstractmethod
    def get_recordings(self):
        pass

    @abstractmethod
    def get_annotations(self, sets: Optional[List[str]] = None):
        pass

    @abstractmethod
    def add_child(self, child: dict):
        pass

    @abstractmethod
    def update_child(self, child: dict):
        pass

    @abstractmethod
    def delete_child(self, child: str):
        pass

    @abstractmethod
    def add_recording(self, recording: dict):
        pass

    @abstractmethod
    def update_recording(self, recording: dict):
        pass

    @abstractmethod
    def delete_recording(self, recording: str):
        pass

    @abstractmethod
    def add_annotations(self, annotations: pd.DataFrame):
        pass

    @abstractmethod
    def update_annotation(self):
        pass

    @abstractmethod
    def delete_annotations(self, annotations: pd.DataFrame):
        pass

class CSVStore(Store):
    def __init__(self, path):
        super().__init__()
        self.path = path

    def get_children(self):
        children = pd.read_csv(join(self.path, 'metadata/children.csv'))
        return children

    # etc.

class SQLStore(Store):
    def __init__(self, engine: Engine, corpus: str):
        super().__init__()
        self.engine = engine
        self.conn = engine.connect()
        self.corpus  = corpus

    def get_children(self):
        children = pd.read_sql(query, self.conn)
        return children
    # etc.

Project and AnnotationManager use their store instance to access or modify the data (always) such that their code should not depend on the choice of the store.

Pros

Low technicality for the user
Easy to combine data from different datasets
All corpora in the DB always have the same format (even if the structure of the DB changes)

Cons

No versioning unless we build one ad hoc
It is unclear what belongs to the client side and what does not.
Not sure how this would scale up performance-wise: what if someone retrieves millions of records in one query?
Backing up large SQL databases is more difficult than backing up files
We can't really do wide-tables unlike with dataframes. This means we need relational tables for custom fields, which means the performance will be even worse.
Data owners don't really own the data (at best they can export it from the database)
If the users retrieve some data from the DB to do some work, later on they have no easy way of knowing whether their data are still up to date

lucasgautheron commented 2 years ago

https://github.com/G-Node/gogs

LAAC-LSCP / ChildProject