LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

Possible future development: towards a centralized storage infrastructure ? #262

Open lucasgautheron opened 2 years ago

lucasgautheron commented 2 years ago

(This is WIP)

Is your feature request related to a problem? Please describe.

The current design in the one described in Managing, storing, and sharing long-form recordings and their annotations.

It can be summed up this way:

There are a few issues that remain unsolved by this design:

  1. Neither of these tools provide the infrastructure to store the data and to process them. Yet, this can be challenging for such data.
  2. Although it is great in many aspects (especially versioning and reproducibility), DataLad may be a bit technical to use for some of the users interested in managing such corpora. Although not required by ChildProject - which works so long as the files are structured properly -, it is the only solution proposed in our design for the retrieval and upload of data.
  3. Not all storage supports handle complex permissions really well - also DataLad is not great if you have too many groups either

Although there are advantages to decentralization, these limitations call for (at least one) centralized database of daylong recordings.

I'll discuss two alternatives:

  1. A web-based database
  2. A DataLad based approach
lucasgautheron commented 2 years ago

Towards a web-oriented, git-less database

Here is a description of the long term goals.

ChildProject should be able to interact with corpora using different storage supports:

  1. Locally using the current standards (metadata as CSV dataframes), which we'll call the CSV interacting mode
  2. Locally using a database (e.g. sqlite, PgSQL, etc.), i.e. the database interacting mode
  3. Remotely through an API, i.e. the API interacting mode

For instance, the third option would apply to users of a centralized database. Most functionalities should work the same regardless of the storage support (i.e. one client API for all storages).

The centralized database (let's call it daylong-db) would come with include two packages:

It should be possible to run processing pipelines remotely too. The API would return a handler which could be used to check for the status of the job at anytime and to retrieve the results. On the server side, the jobs could be run using slurm on a local infrastructure or on a cloud computing provider such as AWS.

Note that it should be possible to convert corpora from any storage format to any other (e.g. export/import from CSV to DB etc.)

Roadmap

  1. Develop ChildProject database interacting mode (one in which CSVs are replaced with tables)
  2. Design specifications for an API interacting mode
  3. Implement it into ChildProject, ignoring processing pipelines for the beginning
  4. Implement a server capable of serving these requests (i.e. daylong-db-server)
  5. Factorize pipelines code
  6. Implement processing pipelines into daylong-db (using slurm to begin with)
  7. Develop a web-based graphical interface for daylong-db

Implementation

Stores

A store is an object that can fetch/updates data from a given storage (local or remote, CSV vs SQL, etc.)

We'd have:

Which would all inherit from a Store abstract class, e.g.:

class Store(ABC):

    def __init__(self):
        pass

    @abstractmethod
    def get_children(self):
        pass

    @abstractmethod
    def get_recordings(self):
        pass

    @abstractmethod
    def get_annotations(self, sets: Optional[List[str]] = None):
        pass

    @abstractmethod
    def add_child(self, child: dict):
        pass

    @abstractmethod
    def update_child(self, child: dict):
        pass

    @abstractmethod
    def delete_child(self, child: str):
        pass

    @abstractmethod
    def add_recording(self, recording: dict):
        pass

    @abstractmethod
    def update_recording(self, recording: dict):
        pass

    @abstractmethod
    def delete_recording(self, recording: str):
        pass

    @abstractmethod
    def add_annotations(self, annotations: pd.DataFrame):
        pass

    @abstractmethod
    def update_annotation(self):
        pass

    @abstractmethod
    def delete_annotations(self, annotations: pd.DataFrame):
        pass

class CSVStore(Store):
    def __init__(self, path):
        super().__init__()
        self.path = path

    def get_children(self):
        children = pd.read_csv(join(self.path, 'metadata/children.csv'))
        return children

    # etc.

class SQLStore(Store):
    def __init__(self, engine: Engine, corpus: str):
        super().__init__()
        self.engine = engine
        self.conn = engine.connect()
        self.corpus  = corpus

    def get_children(self):
        children = pd.read_sql(query, self.conn)
        return children
    # etc.

Project and AnnotationManager use their store instance to access or modify the data (always) such that their code should not depend on the choice of the store.

Pros

Cons

lucasgautheron commented 2 years ago

https://github.com/G-Node/gogs