Should we maintain a local database?

nocollier commented 10 months ago

Currently we do not populate a local database when files are downloaded. Once to_dataset_dict() is called, we query the index node for file information and use the directory_format_template_ and other information in the response to build up a local path. Essentially we are using the remote index as our database which is only useful for querying if a specific file is present.

I chose this originally because no database means nothing to keep clean, nothing to maintain, no additional complexity. As downloads fail or are canceled by the user, the database could become corrupt and we simply avoid this by not having one. However, I am now experiencing some drawbacks. Having the local database could:

Speed up some aspects of querying. This is particularly painful when you are working on an analysis and, since your search is also how you load data into a container, you hit the index node each time just to find out where you local file is located. For big queries it takes a non trivial amount of time. This could be helped if/when the federation moves to the ElasticSearch index.
We also hit the search when we add cell measures. The local database could store where the local cell measures are located to make this fast to recall.
You could also hit a situation where you cannot find a file on your system because either the index node where its information is logged is not responding, or not part of your current ESGFCatalog configuration.
We could store more of the file information in the database to make some operations quick: looking up citation information, checksums, path to nearest cell measures, etc.

It strikes me now that the additional complexity is worth the added benefit. It is really not acceptable that you cannot find a file that exists locally on your system. So then the question is: which format makes the most sense?

esgpull uses a SQLite database. I have some experience with this but it would be a small learning curve. It seems that pandas even has a read_sql() function which would make integration with intake-esgf easy.
Could we just use pandas saved in feather format? See this article.

Probably SQLite is a better choice given that we will be writing to the database in parallel. @mgrover1 thoughts?

mgrover1 commented 10 months ago

I like the idea of maintaining a SQLite database! I think this offers the features we need and satisfies the requirements here.

nocollier commented 8 months ago

I have restructured the queries to get file information and they may not take nearly as long as before. Hold off on this until we see how much of a pain point it remains.

bouweandela commented 8 months ago

You could also consider just using a cache instead of a full blown database, e.g. requests_cache is used by esgf-pyclient to cache searches. That also speeds up the searches that go over the internet, but takes very little effort to maintain and if it becomes corrupted for some reason it can just be deleted.

nocollier commented 8 months ago

Thanks for this suggestion! Will take a look.

esgf2-us / intake-esgf

Should we maintain a local database? #4