Open nocollier opened 10 months ago
I like the idea of maintaining a SQLite database! I think this offers the features we need and satisfies the requirements here.
I have restructured the queries to get file information and they may not take nearly as long as before. Hold off on this until we see how much of a pain point it remains.
You could also consider just using a cache instead of a full blown database, e.g. requests_cache is used by esgf-pyclient to cache searches. That also speeds up the searches that go over the internet, but takes very little effort to maintain and if it becomes corrupted for some reason it can just be deleted.
Thanks for this suggestion! Will take a look.
Currently we do not populate a local database when files are downloaded. Once
to_dataset_dict()
is called, we query the index node for file information and use thedirectory_format_template_
and other information in the response to build up a local path. Essentially we are using the remote index as our database which is only useful for querying if a specific file is present.I chose this originally because no database means nothing to keep clean, nothing to maintain, no additional complexity. As downloads fail or are canceled by the user, the database could become corrupt and we simply avoid this by not having one. However, I am now experiencing some drawbacks. Having the local database could:
ESGFCatalog
configuration.It strikes me now that the additional complexity is worth the added benefit. It is really not acceptable that you cannot find a file that exists locally on your system. So then the question is: which format makes the most sense?
esgpull
uses a SQLite database. I have some experience with this but it would be a small learning curve. It seems that pandas even has aread_sql()
function which would make integration withintake-esgf
easy.feather
format? See this article.Probably SQLite is a better choice given that we will be writing to the database in parallel. @mgrover1 thoughts?