[ ] Replace the external influxDB by a SQLite database created inside the run directory.
[ ] Directly write the tables generated by tasks to it, and make subsequent tasks query the database instead of using the tables.
[ ] Remove harcoded and broadcasted filepaths used for passing down tables and binary/mzML files. Write output filepaths directly into the database. Would better respect the DRY principle.
[ ] Implement visualization by directly querying the SQLite database. For exemple, a streamlit app that uses SQAlchemy connectors (?) to find/select the local database. Then allow the user to specify a time-period and do the query+viz based on that.
It has many advantages:
Profit off the gargantuan number of integrations/libraries for SQLite / SQL databases.
Data stored in only one place.
Apps exist for exploring SQLite database in a GUI.
A lot more futureproof than the open source but company-managed solution that is InfluxDB.
More flexible visualization, even if the streamlit app has to be designed from scratch.
Context
InfluxDB was chosen as a database for (intermediary) results for a few reasons:
Thought for timeseries data (which is our case),
Easy visualization of data in real-time,
Right now, the workflow use .csv tables for intermediary results when no bioinformatics format exist. So everything other than .mzML. These tables are then formatted for ingestion by the InfluxDB python API. The data on InfluxDB is not queried back by the workflow, which uses the .csv tables to pass down data bewteen tasks.
plus: InfluxDB is an optional dependency, used only for visualization (and archiving results).
minus:
What's the point of using a db if you never query it?
There is 2 copies of results: one in the workflow run, one in the external DB. Cleaning the workflow run doesn't directly removes the results on InfluxDB.
Issues
InfluxDB add complexity to the project. As embedding a InfluxDB database/deamon inside workflow runs would be impractical, the workflow has to connect to a global InfluxDB instance, which can be locally running or cloud-based. The user has to link it themselve. It's not that bad, makes it microservices-like (lol). But still, complexity.
InfluxDB visualization tools are not as powerful as initially thought. Doing complex plots would require adding Grafana to the whole stack (complexity++)
Querying a DB becomes the evident answer for the implementation of the fluxes modelling task, which depends on the concentrations of all previous timesteps / the ones in a specified memory window. With the current implementation, we'll have to merge the csv of all / n last timesteps. As InfluxDB can't be assumed to be there.
Limitations
No concurrency on write operations, they are queued (but do we really need concurrency?)
SQlite is not thought for many writes and datetime indexing/querying. Pain-testing should be a priority to determine if this solution is scalable.
They would be extensive usage of composite primary keys, which might not be ideal (slower queries).
Proposition
It has many advantages:
Context
InfluxDB was chosen as a database for (intermediary) results for a few reasons:
Right now, the workflow use
.csv
tables for intermediary results when no bioinformatics format exist. So everything other than.mzML
. These tables are then formatted for ingestion by the InfluxDB python API. The data on InfluxDB is not queried back by the workflow, which uses the.csv
tables to pass down data bewteen tasks.Issues
n
last timesteps. As InfluxDB can't be assumed to be there.Limitations