aiidateam / aiida-core

The official repository for the AiiDA code
433 stars 186 forks source link

"Store in the DB what you want to query for" #3714

Open ltalirz opened 4 years ago

ltalirz commented 4 years ago

We currently advise plugin developers to store in the DB what they want to query for, and the rest in the file repository. I think this is sound advice.

While analyzing the size of a production DB, I noticed some places in aiida-core where we may be violating this principle (?).

A) As it turns out, the largest rows in the db_dbnode table are from process nodes, whose attribute field is 2kB in size. The reason for this is that we are storing the raw squeue output in the last_jobinfo->raw_data field. Is this something you want to query for? Should that not better be parsed, and then go to a log file?

B) The largest rows in the db_dblog table are also several kBs in size. The log messages can contain a potentially very long python traceback. Do we want to query for that? I also noticed that we seem to be storing this potentially long message twice: once in the top-level message column and once in the message field of the metadata jsonb column.

The reason I'm asking these things is that for large screening studies (say, 1M materials), you are dealing with ~10M nodes at least. E.g. we already designed the CifData class in such a way that you don't need to store the atoms in the DB, but if AiiDA then stores 10 kB of data per process node in the DB (meaning 1M processes directly imply a database of >=10GB), these savings become irrelevant.

Mentioning @sphuber and @giovannipizzi for comment

sphuber commented 4 years ago

Good point. Note that the large process nodes here are really just the CalcJobNodes. I think there is certainly a case to be made to move the raw last job info to the repository. Maybe we come up with a set of most likely to be queried properties that we leave in the attributes. However, until we fix the repository and make it scale, we are most likely going to shift the problem. For my big databases, it is not really the database that is the problem but rather the repository as it is exploding the file system and it is impossible to backup. The same goes for the exceptions. If we are duplicating information, that should definitely be fixed and also there we should see if we are not better off moving things to the repository. I think many of these things are a perfect candidate for discussion during the CINECA hackathon. I will add it to the tentative agenda.