Maybe improve our database

FynnBe commented 2 months ago

Our database is now a series of files on S3. So far this is sufficient and the minio (python) client (and our Client wrapper around it) allow for convenient access and inspection of our database. We might want to look into more standard approaches, so this issue serves as a place to take notes and discuss this eventually.

idea1: We could create an index DB for our collection: https://aws.amazon.com/de/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/

FynnBe commented 2 months ago

SQL model might be another fit for an index db: https://github.com/tiangolo/sqlmodel

oeway commented 2 months ago

Database are useful when we have complex queries, like find all the dataset linked to model a which also applies to model b like operations. Right now, it is enough to just go for the s3 files, the summary file are essentially for search from the website, and we have a clear submission, and publishing workflow, each steps are distinctive. So we won't really need a dedicated database.

Separate files on s3, or a single file in the database are two different approach, for now I would stick with s3, since it's much easy to make changes to individual files without impacting all the records, while editing database files are much less straight forward, and require more attention in backup the database, migrating the database etc.

If you have both s3 files and database, then we are creating two sources of truth. If we end up needing a database, e.g. create a hypha service for advanced model search, I would built the database on the fly from s3 files and use s3 as the truth data source.

Plus, we don't really need a dedicated database, since S3 also support SQL syntax for searching over json files. See s3-select: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

FynnBe commented 2 months ago

creating an "index database" on the fly was more what I had in mind... but this should be left for future optimization in any case. I suppose in the long term we could replace the collection.json with such a light database, but we should be fine with the json for quite some time 👍

bioimage-io / collection

Maybe improve our database #52