COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
57 stars 27 forks source link

One database for every experiment #309

Open aidanheerdegen opened 1 year ago

aidanheerdegen commented 1 year ago

What about moving to a model where every experiment has it's own DB file? The central DB would query each individual DB for the information it needed for explorer functions.

Whaddya reckon @angus-g @rbeucher @micaeljtoliveira

(Yes I did think about putting this on the forum, and could still do so, but thought it was quite technical)

rbeucher commented 1 year ago

I was approaching this differently but I haven't given it much thoughts to be honest. The main issue in my opinion is that the COSIMA Cookbook reindex everything periodically while we only want to add new experiments.

I have not looked in details but FREVA (Germany) uses a combination of a SQL database (MariaDb) and an Entreprise search service like SOLR. SOLR has some capabilities to do live indexing. Could be worth looking into that.

rbeucher commented 1 year ago

I think one Database is still the way to go. Indexing is the issue. Breaking it into multiple DB introduce as many new points of failure.

aidanheerdegen commented 1 year ago

What are you trying to solve by doing this?

This approach is not only embarrassingly parallel, but also potentially asynchronous: when syncing new data to /g/data the same script can then run a command to reindex the data.

It is a "separation of concerns" argument: each experiment is responsible for keeping it's DB up to date. Hence modular and usable in different ways: access a single experiment DB directly, or discover via database of databases.

Potentially this approach also puts less stress on the lustre metadata servers (but I know very little about lustre). Finding all new files to index by doing this

find . -newer index.db

in a single experiment directory is asking a lot less of the filesystem than doing the same thing over many subdirectories.

This approach might also scale better. It would be possible to have multiple meta-DBs that contain different collections of data, but there can be overlap between them. e.g. Antarctic datasets, BGC datasets. This could be seen as a negative, with balkanisation of data, but as the number of experiments grows it might not be feasible to have them all in a single DB.

Breaking it into multiple DB introduce as many new points of failure.

It is possible to see it the other way around: a single point of failure is bad. Multiple separate DBs is potentially a good thing. One DB gets corrupted and it's no big deal to reindex.

Some drawbacks to one DB per experiment:

Duplication

Meta-data is duplicated across databases, e.g. variable names, attributes etc

Version synchronisation

Database schema can be updated, which would require tools to either robustly support old versions (potentially a lot of work), or all the individual DBs would have to be updated. This is not nice.

micaeljtoliveira commented 1 year ago

Sounds like you are trying to solve the problem of sqlite not scaling well nor allowing concurrent access by using many sqlite databases. Sure, it could work, but it does look like an ugly hack if you consider that there are other SQL implementations out there that don't have these issues... Maybe it would be time to have a serious discussions with NCI about this?

angus-g commented 1 year ago

For sqlite concurrent access, has anybody enabled write-ahead logging? Particularly because it's only reader/writer concurrency we care about:

WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.

aidanheerdegen commented 1 year ago

Sounds like you are trying to solve the problem of sqlite not scaling well nor allowing concurrent access by using many sqlite databases. Sure, it could work, but it does look like an ugly hack if you consider that there are other SQL implementations out there that don't have these issues...

Sort of, but once I thought about it I realised there are other things going for it. As I said above. I like the idea of experiments updating their own DB. It has some graceful degradation: the main DB is not required to access a specific experiment if the location of the experiment is known.

There is also a pretty horrendous name-space issue already happening, where experiments have increasingly convoluted names for uniqueness. Clearly some of that is from payu, but even if that weren't the case it would be necessary for indexing into a single monolithic DB. I'd bought up the idea of "collections" previously: another level of organisation on top of experiments. They are a common idea in data catalogues. The different index-DBs naturally map to the idea of a collection.

I honestly think this might be a better design, but that also happens to help with scaling and indexing.

From an ACCESS-NRI point of view it might be desirable to generate a meta-DB/index-DB that only contains "supported"or published datasets, whereas the COSIMA Community will want to have the bleeding edge stuff as well.

Equally there may be subject specific index-DBs, e.g. land surface, CMIP6,. You can have datasets appearing in more than one index-DB, there is virtually no cost to this once each experiment has it's own index.

I'm not insisting subject-specific index-DBs is necessarily the way to go, but you could literally not even think of doing this with the current approach. It would just be too cumbersome to index an experiment more than once, and maybe get the filesystem folks at NCI quite upset.

Maybe it would be time to have a serious discussions with NCI about this?

Sure. They're already talking about some indexing stuff, and @rbeucher had a meeting with them about this. Probably best to discuss offline I would say.

For sqlite concurrent access, has anybody enabled write-ahead logging?

Yeah I did look into that, but I think:

  1. All processes using a database must be on the same host computer; WAL does not work over a network filesystem.

is a show-stopper, but happy to be set straight if I've got the wrong end of the stick.

aidanheerdegen commented 1 year ago

I don't want to keep banging on, but with the approach I'm advocating in this issue an error such as https://github.com/COSIMA/master_index/pull/30 no longer stuffs up indexing for other experiments, as the error is contained to the problematic experiment.

Just sayin'

rbeucher commented 1 year ago

I don't like it... I second @micaeljtoliveira calling it a ugly hack... ;-) Let's explore a bit more...

aidanheerdegen commented 1 year ago

I second @micaeljtoliveira calling it a ugly hack... ;-)

I am feeling bullied :p

micaeljtoliveira commented 1 year ago

I don't want to keep banging on, but with the approach I'm advocating in this issue an error such as https://github.com/COSIMA/master_index/pull/30 no longer stuffs up indexing for other experiments, as the error is contained to the problematic experiment.

Glad you mention it. In this case I would actually argue that the underlying problem lies with the indexing script, not the database. In my opinion the indexing script should not return an error code in the first place when failing to index a file, but instead return a warning. It should only return an error for something that could compromise the integrity of the database, like loosing the connection to the database while updating it.

aidanheerdegen commented 1 year ago

In this case I would actually argue that the underlying problem lies with the indexing script, not the database. In my opinion the indexing script should not return an error code in the first place when failing to index a file, but instead return a warning. It should only return an error for something that could compromise the integrity of the database, like loosing the connection to the database while updating it.

Agreed. If we're throwing around the epithet "ugly hack" the original shell script for this qualifies in spades (to be clear, I am referring to the script I wrote).

Hasn't change my mind about separately indexing experiments though :)

aidanheerdegen commented 1 year ago

In case it isn't obvious, you could also include a DB update step to the sync script. Each experiment DB is then only updated when the contents of the directory are updated.

The Cookbook DB script then just has to check a number (50-200?) of experiment DBs and check which have been updated since the last time they were slurped in.

That is a signficant reduction in disk access.

aidanheerdegen commented 1 year ago

Also can't fathom why I hadn't invoked this before

71312p

rmholmes commented 1 year ago

Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others).

aidanheerdegen commented 1 year ago

Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others).

I am feeling some momentum building .. :)

adele-morrison commented 1 year ago

I haven't weighed in, because I don't feel qualified. But if you want a non-expert opinion, it sounds like a great solution to me Aidan! This has been an ongoing issue for some time now and this approach of multiple databases is very easy to implement and could fix our problems tomorrow. I often have my own databases anyway, because of new simulations that aren't in the master and issues with the existing database not updating or being available.

On Thu, 17 Nov 2022 at 16:40, Aidan Heerdegen @.***> wrote:

Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others).

I am feeling some momentum building .. :)

— Reply to this email directly, view it on GitHub https://github.com/COSIMA/cosima-cookbook/issues/309#issuecomment-1318109362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA44U3EBETRSMYMGDSOHS3WIXALZANCNFSM6AAAAAARW3QZRE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aekiss commented 1 year ago

Will querying a database of databases slow down the explorer?

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/access-nri-postgres-db-for-capturing-provenance-usage-metrics-and-cosima-cookbook-index/153/7

aidanheerdegen commented 1 year ago

Will querying a database of databases slow down the explorer?

My assumption was that it would speed it up, as I'm proposing the DBDB/Index would contain significantly less information than it does currently. It would only have enough to info to help users decide which experiment to use. This is currently the way the DB explorer works pretty much. It creates a mapping from variable to experiment, to enable users to then decide which experiment to query.

dougiesquire commented 1 year ago

I think this approach is somewhat like what is used for the Pangeo Intake catalogue. They have a master catalogue that references other catalogues of "climate" (CMIP), "oceanographic" etc datasets.

rbeucher commented 1 year ago

I have looked at using the COSIMA-Cookbook to index the ACCESS-ESM/CM experiments. That can work but it is very slow and inefficient.

I spent a bit of time looking at the LUSTRE architecture and found that article: https://people.cs.vt.edu/~butta/docs/CCGrid2020-BRINDEXER.pdf

This is about file metadata indexing but it has some good ideas that are somewhat related to the idea of having multiple databases. I would probably not go for a one database per experiment design as suggested by @aidanheerdegen but database sharding using simple RDMS like SQLite could work. It does not look too complicated to set up.

There are some information about sharding in the SQLAlchemy documentation: https://docs.sqlalchemy.org/en/14/_modules/examples/sharding/separate_databases.html

Anyway, Indexing is a priority for the MED Team so we will definitely look into this.

aidanheerdegen commented 1 year ago

I would probably not go for a one database per experiment design as suggested by @aidanheerdegen

:|

rbeucher commented 1 year ago

You know what I mean 🤪