Open lucasgautheron opened 2 years ago
Here is a description of the long term goals.
ChildProject should be able to interact with corpora using different storage supports:
For instance, the third option would apply to users of a centralized database. Most functionalities should work the same regardless of the storage support (i.e. one client API for all storages).
The centralized database (let's call it daylong-db) would come with include two packages:
It should be possible to run processing pipelines remotely too. The API would return a handler which could be used to check for the status of the job at anytime and to retrieve the results. On the server side, the jobs could be run using slurm on a local infrastructure or on a cloud computing provider such as AWS.
Note that it should be possible to convert corpora from any storage format to any other (e.g. export/import from CSV to DB etc.)
A store is an object that can fetch/updates data from a given storage (local or remote, CSV vs SQL, etc.)
We'd have:
Which would all inherit from a Store abstract class, e.g.:
class Store(ABC):
def __init__(self):
pass
@abstractmethod
def get_children(self):
pass
@abstractmethod
def get_recordings(self):
pass
@abstractmethod
def get_annotations(self, sets: Optional[List[str]] = None):
pass
@abstractmethod
def add_child(self, child: dict):
pass
@abstractmethod
def update_child(self, child: dict):
pass
@abstractmethod
def delete_child(self, child: str):
pass
@abstractmethod
def add_recording(self, recording: dict):
pass
@abstractmethod
def update_recording(self, recording: dict):
pass
@abstractmethod
def delete_recording(self, recording: str):
pass
@abstractmethod
def add_annotations(self, annotations: pd.DataFrame):
pass
@abstractmethod
def update_annotation(self):
pass
@abstractmethod
def delete_annotations(self, annotations: pd.DataFrame):
pass
class CSVStore(Store):
def __init__(self, path):
super().__init__()
self.path = path
def get_children(self):
children = pd.read_csv(join(self.path, 'metadata/children.csv'))
return children
# etc.
class SQLStore(Store):
def __init__(self, engine: Engine, corpus: str):
super().__init__()
self.engine = engine
self.conn = engine.connect()
self.corpus = corpus
def get_children(self):
children = pd.read_sql(query, self.conn)
return children
# etc.
Project
and AnnotationManager
use their store instance to access or modify the data (always) such that their code should not depend on the choice of the store.
(This is WIP)
Is your feature request related to a problem? Please describe.
The current design in the one described in Managing, storing, and sharing long-form recordings and their annotations.
It can be summed up this way:
There are a few issues that remain unsolved by this design:
Although there are advantages to decentralization, these limitations call for (at least one) centralized database of daylong recordings.
I'll discuss two alternatives: