Materials-Consortia / optimade-python-tools

Tools for implementing and consuming OPTIMADE APIs in Python
https://www.optimade.org/optimade-python-tools/
MIT License
68 stars 42 forks source link

MongoDB cache #881

Open JPBergsma opened 3 years ago

JPBergsma commented 3 years ago

In a discussion with @CasperWA and @giovannipizzi, we thought that it would be good to have the option to store the cartesian site positions and the species_at_sites fields outside MongoDB and only store a link to this data in the MongoDB. This would mean that the Mongo database would be smaller and would more easily fit in the memory or even the cache memory of the server thereby improving the speed of searches. It would be nice if functions, that could automatically retrieve this data from files, could be added to the optimade-python tools. This will be especially useful after the trajectory endpoint has been added to the OPTIMADE standard, as trajectories are much larger than structures.

CasperWA commented 3 years ago

I think the main idea originally expressed by @giovannipizzi (he should obviously fill in a bit here), was to use a MongoDB database as a cache-database. This would then run alongside any backend that is otherwise implemented (be it AiiDA, Elasticsearch, SQLite, or another MongoDB database, or anything, really).

The idea is that, especially with trajectory data, the database can explode in size. A special MongoDB database would then represent a curated, minimized set of information used for querying on well-known entities, returning the minimum set of required and recommended data in a response. If other more extensive information is required, the implementation will then turn to the full specific backend, again, whatever it may be. In this way query speeds should be optimized, and we also have the possibility to optimize the data curation from a central point (this generally used package).

Of course, if disk size for indexing and similar is not an issue in an existing database implementation, and the speed gain is therefore minimal, then this can be turned off. But with the introduction of trajectory data, this might be useful for most implementations?

JPBergsma commented 3 years ago

Thank you for clarifying this. MongoDB is probably not a good choice for the backend for handling the trajectories, as it has a document size limit of 16Mb. That would be too small to hold larger trajectories. Although it is of course possible to break up a large trajectory into smaller pieces. Perhaps it would also be a good idea to look at how other fields, like geo-information science, handle these kinds of large datasets.

ml-evs commented 3 years ago

Just a technical comment that GridFS would be the canonical MongoDB way of handling binary blobs (see also https://docs.mongodb.com/manual/core/gridfs/).

As I missed the discussions I don't really understand the use case from our side, I guess we could add some abstractions that register some of the "queryable=False" OPTIMADE fields as binary data which could get pushed to a different backend, but presumably if the field is requested with response_fields it still needs denormalized/joined so that it can be added to the JSON response? Also not clear to me how this would work with the frame selection/view of trajectories as a collection of structures, as you still need to be able to index within this binary positions data?

Not expecting any answers but just my initial thoughts :sweat_smile:

giovannipizzi commented 3 years ago

Hi all, a few additional comments for context:

Therefore, after experimentation, we're converging on the idea to keep all data in the native DB-provider format in the original DB; and then add a layer on top, probably in MongoDB, that only contains the queryable OPTIMADE fields. This could be updated periodically by the DB owner in case data in the internal DB changes (if this is e.g. run daily with cron, the update can also take a few minutes and this does not impact performance of live usage of the DB).

Of course having this MongoDB-layer should NOT be a requirement, but just a convenience for those DB-implementers that find it useful (the code would be mostly shared, there would be no need to change the internal schema, it would be independent of the actual DB engine used in the backend, and optimisations could be developed once and for all in the same package fro the MongoDB layer only). (Note: I agree with @ml-evs that one should use GridFS to store files if needed, but since my primary goal would be to have an efficient caching/querying layer in front of the actual DB, I would minimise the amount of data replicated into the MongoDB and only keep the minimal queryable information, referring to the actual data in the internal DB when needed - e.g. for crystal structure coordinates, trajectories, ... - in this way one also avoids to duplicate each DB size).

I don't know if the best place is inside optimade-python-tools, on in a top package that depends on it. The way I envision this to work is that one could define an abstract class, where a number of methods and properties (those queryable) are defined and store/fetch data in MongoDB; while the others are abstract methods would need to be implemented (e.g. to provide a DB-implementer-specific way to fetch atomic positions, trajectories, ...). There would also be some utility methods/functions (to be implemented by each DB provider) that define how to get the data to put in the MongoDB (and the library could implement some standard conversion, e.g. get the structure in some common format, and then take care automatically of computing and storing the cell volume, chemical formulas, number of sites, information on the species in a format easy to query, ...). If the internal DB provides some reliable modification date, it would also be easy to make the daily update fast: each entry in this MongoDB-layer would also contain both the UUID of the structure in the source DB (this is needed anyway) + the mtime; when the update routine is called, the first thing will be to check if there are new UUIDs, if UUIDs have been deleted, and which known UUIDs have a more recent modification time in the source DB. Only those need to be re-processed (making thus an incremental update where nothing has changed very fast, probably <5 seconds only even in large DBs).

I guess one would need to start implementing this to see if there are technical issues with this idea, but this would be a good very useful in my opinion. In addition, one could also think to have a specific implementation of the abstract class that stores in some simple format (even just JSON) also the structure in MongoDB/GridFS - this would act as a reference implementation, and can be used by people who just have a few structures and want to offer them via OPTIMADE (e.g. one could just import structures with ASE and have a routine that loads them in the DB in this simple implementation).

(sorry for the long message... I hope it's clear, otherwise feel free to ask for clarification)

JPBergsma commented 3 years ago

If it is not an option to store the Optimade fields in the original database, and the Optimade querries can not easily be translated to queries for the original database. It would indeed be usefull to have such a MongoDB cache. I do not know the specifics of the AiiDa databases your are talkiing about so I do not know if this is the case for you.

I however do not understand this argument:

Therefore, to make OPTIMADE queries efficient, one has often to enrich the "native" data with additional OPTIMADE-like data in the DB, but this is not always easy or feasible (e.g. in the case of AiiDA, we 1) don't want every user to also have this data by default, and 2) it it were there by default, it might not be obvious on how to create and update this data upon creation of a new crystal structure, in an efficient way also when managing thousands of structures).

I think you would have to generate those fields for the small MongoDB cache as well, so I do not think the MongoDB cache would solve this problem.

Some databases allow you to run scripts each time data is changed or added. In that case you can use such a script to automatically update the cache any time there is a change.

Databases also apply several tricks such as indexing to speed up the query process., If its not a problem to add the Optimade fields to an existing database, it may not be faster to implement a seperate database because the existing database may already be implementing a cache behind the scenes.

ml-evs commented 3 years ago

This is drifting a bit from the original caching intention of this issue, but I quite like the idea of mapping filters (or at least providing classes to map filters). Can continue discussion on this at the older issue https://github.com/Materials-Consortia/optimade-python-tools/issues/837.

I think it could go a long way to solving some of these issues - we can also exploit the fact that filtering on many fields is optional, so this mechanism could also be used to disable certain filters (e.g. string matching chemical formulae that are in the wrong format somehow).