RMG Improvement Proposal [RIP]: Make `RMG-database` a SQL Database [COMMENTS WELCOME!]

JacksonBurns commented 1 month ago

[!IMPORTANT] Please read and comment on this issue if you are a developer or user of RMG - we need lots of input!

The purpose of this issue is to centralize discussion around a significant change that @JacksonBurns and @jonwzheng are proposing for RMG-database. Also see the first RIP here: https://github.com/ReactionMechanismGenerator/RMG-Py/issues/2684

This issue is styled after the Python Enhancement Proposal (see PEP 733 for an example), thus the name 'RMG Improvement Proposal', or RIP for short.

`RMG-database` Today

RMG-database is a collection of Python files organized in a layout reflecting their contents. This includes:

statmech, transport, thermo, and solvation- contains libraries of literature results for individual molecules and then groups for making estimates
kinetics - similar to the above, except that groups belong to families which also contain rules and training data for generating said groups and rules.
surface - limited library with some surface data.
reference_sets - actually not Python files, but yml files containing "vetted" values for RMG/Arkane to fit data to perform BACs
quantum_corrections - Python dictionaries containing frequency scale factors, BACs, and AECs

In order to interact with RMG-database, one must have a working installation of RMG-Py and use its associated data classes to access numbers stored here. RMG-Py itself runs Python exec function on these files to load them into global memory, once per process.

Challenges with Today's `RMG-database`

This format introduces many 'hard' and 'soft' challenges, which are detailed below:

Using exec in this context (and in general) is bad coding practice
- Security risk, because it involves executing arbitrary user input code
- It relies on using the global state, which is difficult to debug and itself considered bad coding practice
Accessing RMG-database is only possible via RMG-Py, whose dependency requirements are so restrictive that it functionally forbids running RMG-database with anything else installed
Contributing to RMG-database is difficult since it requires formatted Python files, does not interface with any common data munging software, and as mentioned before is difficult to debug
Loading the entire database is slow, which can be seen when loading a database page for the first time since restart in RMG-website
Much of what we validate in the unit tests for RMG-database can be had for free with a SQL database, examples:
- kinetics_check_siblings_for_parents checks that siblings nodes are not also parent nodes, which can easily be validated at creation time in a SQL database
- check_surface_thermo_libraries_have_surface_attributes checks that certain entries have expected rows, which we can enforce by simple construction in a SQL database

Our Proposal and a Working Demo

@jonwzheng and I (@JacksonBurns) propose we overhaul RMG-database from the ground up as a SQL database. To that end, a working demo has been built showing how statmech could be converted into a SQL database, see this repository: https://github.com/JacksonBurns/rmgdb

In short, we do the following for each of the sub-databases in RMG-database that follow the library + family/group setup:

Define a set of Tables which represent each of the RMG Classes that are called within RMG-database, such as LinearRotor, and a 'base' table to hold all the calls to entry in each of the sub-databases. The aforementioned validation can be implemented here using SQL constructs like Triggers, Constraints, Key Relationships, etc Example of libraries schema Example of groups schema
exec the files in RMG-database, but trick them into generating our new Tables rather than RMG classes.

Dump the database into a plaintext format like .yml which would replace the Python source files we currently have, like this:

short_description: B3LYP/GTBas3
long_description: ''
label: HF
adjacency_list: |2

1 F u0 p3 c0 {2,S}
2 H u0 p0 c0 {1,S}
statmech:
energy: -282.308
energy_unit: kJ/mol
modes:
ideal_gas_translation:
  mass: 20.0062
  mass_unit: amu
linear_rotor:
  linear_inertia: 0.809097
  linear_inertia_unit: amu*angstrom^2
  linear_symmetry: 1.0
harmonic_oscillator:
  harmonic_freq_unit: cm^-1
  harmonic_freq_1: 4113.43

Load the database from the .yml files, enabling users to contribute to RMG-database by just editing the .yml files (proper configuration and formatting can be enforced by GitHub actions).

Critically, this would allow the users to run one-liner commands with only pandas installed to then interact with the RMG-database, like this:

(rmgdb) (base) jackson@jackson-Precision-7540:~/rmgdb/data/rmgdatabase/statmech$ python
Python 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.read_sql("SELECT * FROM statmech_libraries_view", "sqlite:///statmech.db")
      id         name short_description long_description            label  ... harmonic_freq_8  harmonic_freq_9 harmonic_freq_10  harmonic_freq_11  harmonic_freq_12
0      0  halogens_G4      B3LYP/GTBas3                                HF  ...             NaN              NaN              NaN               NaN               NaN
1      1  halogens_G4      B3LYP/GTBas3                               HBr  ...             NaN              NaN              NaN               NaN               NaN
2      2  halogens_G4      B3LYP/GTBas3                               HCl  ...             NaN              NaN              NaN               NaN               NaN
3      3  halogens_G4      B3LYP/GTBas3                                F2  ...             NaN              NaN              NaN               NaN               NaN
4      4  halogens_G4      B3LYP/GTBas3                               FCl  ...             NaN              NaN              NaN               NaN               NaN
..   ...          ...               ...              ...              ...  ...             ...              ...              ...               ...               ...
189  189  halogens_G4      B3LYP/GTBas3                   BrC(Br)DC(Br)Br  ...         496.116          632.671          752.302           866.301           1587.25
190  190  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Cl)Cl  ...         538.764          775.262          891.705           968.279           1625.20
191  191  halogens_G4      B3LYP/GTBas3                   ClC(Br)DC(Br)Br  ...         507.498          664.464          795.117           899.928           1596.78
192  192  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Br)Br  ...         520.415          715.925          810.937           933.346           1606.58
193  193  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Cl)Br  ...         528.548          737.885          864.898           949.118           1616.44

[194 rows x 33 columns]

This would make it trivially easy to access RMG-database and its wealth of chemical data.

There are further benefits on the RMG-Py side of things. Navigating the decision tree structure is dramatically faster than the current setup because it uses the SQL adjacency list layout for storing hierarchical data, enabling tree navigation by simple matching of integers. Accessing the data in this way will also have a massive positive impact on RMG-Py's memory consumption - the current setup requires each parallel process to load the entire database into memory, whereas this would be shared among all processes and allow easy loading of only the required data.

Next Steps, Drawbacks, and Open Questions

The amount of value of the data (and how difficult it is to get to it) in RMG-database makes this step worth doing on its own. Part of the reason that the linked demo already exists is because @jonwzheng and I will likely see it to its end even if just for our own usage, since we would like to be able to access RMG-database in other projects.

The purpose for this issue, then, is to discuss what issues this could bring up with RMG-Py and how we can mitigate them during the design process.

Difficulty of Integration with `RMG-Py` and `RMG-website`

The database is arguably the most important piece of the RMG-Py-puzzle, and so it is used throughout the source code in many different functions. This is not a critical issue, since re-implementing any of the needed functionality will just be a matter of effort, but it is worth mentioning. Also promising is that most of this functionality never changes (i.e., we will always need to find all the ancestors for a given node, a function which would never change), so once we implement it in SQL and wrap it in Python it can just sit.

We would need to update all of our new user documentation. Workshop materials from previous years will also become out-of-date.

More serious is the integration with the various notebooks and scripts people have assembled over the years to create RMG-database. While we can do the best we can to provide examples of changing the main RMG code to work with the new database, things like the group fitting notebooks will need serious overhauls both on the main branch and for people locally, as well as the RMG-website source code.

On that note...

Backwards Compatibility

This would be totally backwards incompatible with previous versions. Outstanding PRs, as well, would need to be restarted completely in order to work.

...and...

What do we Keep?

This is perhaps the biggest open question. From our understanding, the libraries (used as lookup tables during RMG simulations) in each of the sub-databases contain data scraped from literature/simulated by us for various chemicals and chemicals reactions - we would definitely keep those. The training directories, as well as the rules and groups files are less obvious to handle. We believe that the training reactions can be generated automatically from the libraries, though that has not been done for all meaning that some of them are hand constructed. Similarly, some of the rules and groups appear to be hand-built whereas others are machine generated.

We ask this question for two reasons - it will inform the design of the database, and because it will determine the scope of the work. If it turns out that we want a way to automatically refit all the trees whenever we push new data, thus replacing the rules, groups, and training (?), we could incorporate that into this larger effort.

Please let us know your thoughts and any suggestions you have about how to best approach this - especially those related to the kinetics database, specifically the workflow of library -> training -> rules/groups. After this issue has been opened, we will schedule a board meeting to discuss the way forward. Thank you!

mjohnson541 commented 1 month ago

This is a big proposal! There are certainly some details I'm interested in, but maybe this isn't the best forum for a nuts and bolts discussion. More broadly I'm interested to hear:

1) User/Developer Vision: How do you guys envision this changing how users and developers interact with and query the database? How does this change how we look at, update, query, modify, version control etc. the RMG-database? Do students need to learn SQL to do these things?

2) Programmatic/Computational Vision: How do you guys envision this changing how RMG-Py programmatically and computationally interacts with the RMG-database? How does this affect how we write code in RMG-Py? What queries are being made and how often? Are there speed/computational concerns? If so how what benchmarks would you run in the future, if you haven't already, to resolve that?

3) Installation Vision: How do you guys envision this changing how RMG is installed? Databases present a number of more unique installation challenges beyond simply installing the software for the database...How does the database get setup in a typical installation? Installing and keeping databases alive on clusters particularly that aren't owned by the user can be tricky, forbidden or require special permission...how do we help users in that situation?

alongd commented 1 month ago

Excellent proposal, I fully support the aspects of converting the database from executing Python files to SQL. I'm also interested to hear thoughts about version controlling the new database. It's also a bug plus, as mentioned, to be able to use the RMG-database with minimal dependencies. 🥇

Some initial thoughts about the kinetics section: It could be OK to keep only the training reactions, and automatically generate the trees every time the database is updated, storing the updated tree with the generated rules. We also need to preserve the ability of the database to explain how it came up with estimations. Currently, a major thing we (and the community) are missing is high quality meta data and specifically uncertainties. If we can quantify, or even just reasonably estimate, the input data uncertainties and then propagate those into the actual database thermo-kinetic estimations -- that would be a huge thing. (we have this rank for kinetics, but it's not always used, plus it's not entirely unclear what this means in terms of uncertainty)

Another point to consider is that our kinetics data has Tmin and Tmax. They are not always given, which is sometimes understandable. But in my understanding they are not used by RMG when making estimations. Maybe we could implement this as well into the new database.

Another point for discussion if gas phase vs. liquid phase: Many of our kinetic libraries have mixed reactions in the sense that some of them could be used for liquid phase with an appropriate correction, while others are strictly gas phase reactions (e.g., with specific 3rd body colliders or with a PDep expression). We could consider adding an attribute for a library reaction of whether it is appropriate for gas/liquid/or both.

JacksonBurns commented 1 month ago

Thank you both for your speedy feedback! I will quote reply to individual points to make sure we cover everything.

Matt's Notes

User/Developer Vision: How do you guys envision this changing how users and developers interact with and query the database?

For users who just want to contribute a couple known reactions, thermochemical values, etc. they would be able to clone RMG-database, edit the YAML files to add/modify their results (made easier by the reduced formatting), and then open a PR. They would never be responsible for generating the database; GitHub Continuous Integration would handle that.

For users who want to add a new library for their own local installation, the procedure would be the same as it is now, with the exception that they would need to first add the data and then generate the database. We plan to make this generation code part of the rmgdb standard, to which RMG-database will conform, which would make this step easy.

For developers, the workflow would be largely the same as it is now, but just editing in YAML format. We would thoroughly document the process for building the actual database file so that they can ensure their changes are correct and unit tests are passed, though that will also be done through GitHub actions anyway.

How does this change how we look at, update, query, modify, version control etc. the RMG-database?

Reading from the database would take place via Python functions that @jonwzheng and I write as part of the rmgdb standard. Examples include the function we wrote to load the entire statmech database - it is pure Python, though it uses the underlying SQL engine behind the scenes.

Modifying the database would happen by modifying YAML files, and then rebuilding the database from the YAML files. It is these same YAML files containing the data which would be version controlled. This was chosen so that git diffs would still be meaningful, though YAML is not the final choice (open to suggestions), and so that users could make edits without learning any 'style' (like the current Python formatting).

Do students need to learn SQL to do these things?

Our goal is that users can interact with rmgdb with no SQL at all. Once we have the Tables, Constraints, etc. set up we do not expect them to change over time, since the data we want to track will be necessarily consistent over time. The algorithms we use on the database (e.g. "find all descendants of a node" or "check if nodes are siblings") can be implemented as part of the rmgdb standard and remain static, much like their current implementations in RMG-Py. If for any reason the SQL schema or views do need to be modified, making changes is simple and can be accomplished through the sqlalchemy API (examples: schema and views).

Programmatic/Computational Vision: How do you guys envision this changing how RMG-Py programmatically and computationally interacts with the RMG-database? How does this affect how we write code in RMG-Py? What queries are being made and how often? Are there speed/computational concerns? If so how what benchmarks would you run in the future, if you haven't already, to resolve that?

I covered this partially above, but will restate - our goal is that many of the functions that RMG-Py uses for navigating the database and accessing records can be implemented in rmgdb and remain there. To better address how this standard would impact coding in RMG-Py, see the timeline at the end of this reply.

As far as speed/scaling/benchmarks - we will keep these suggestions in mind and work something up!

Installation Vision: How do you guys envision this changing how RMG is installed? Databases present a number of more unique installation challenges beyond simply installing the software for the database...How does the database get setup in a typical installation? Installing and keeping databases alive on clusters particularly that aren't owned by the user can be tricky, forbidden or require special permission...how do we help users in that situation?

For the time being, we have chosen the sqlite database as the actual implementation of the SQL standard. This is very lightweight and is included as part of Python's standard library, so no new dependencies except sqlalchemy (our chosen SQL wrapper code, which is robust and well established) are required. We envision having an rmgdatabase package (like we have now) that includes the rendered .db file, which can then be accessed via the pre-written Python functions included in the rmgdb standard.

I believe that for remote computing systems, the only sqlite challenge would be file locking, though this is something that can be worked around. sqlite does not require running a server, making HTTPS requests, etc. - it is all done in memory, locally.

Alon's Notes

[...] If we can quantify, or even just reasonably estimate, the input data uncertainties and then propagate those into the actual database thermo-kinetic estimations -- that would be a huge thing. [...]

This is definitely something we can look to include. When we look to edit the tree fitting and training generation code, we can consider adding fields to the database and steps in the algorithm which achieve these goals.

Another point to consider is that our kinetics data has Tmin and Tmax. They are not always given, which is sometimes understandable. But in my understanding they are not used by RMG when making estimations. Maybe we could implement this as well into the new database.

When we get to the stage of integrating RMG-Py with rmgdb we can keep this mind! (See last part of this giant comment).

[...] We could consider adding an attribute for a library reaction of whether it is appropriate for gas/liquid/or both.

Great suggestion! This actually lends itself very well to some SQL programming constructs. What I imagine is that we have one table that stores our known Reactions. We then have a separate table, which is linked to that Reactions table by the unique identifier for each reaction, that stores Correction terms. It is then trivial to write a SQL command which loads reactions with (or without) corrections, etc. that can then be wrapped in Python and made easily available to RMG.

Our Proposed Path Forward

After some offline discussion, @jonwzheng and I have come up with this proposed plan for following through on this RIP.

Database Version 4.0.0 (one major version above the current RMG-database) - a complete rewrite of just RMG-database's data into rmgdatabase (a separate repository), the latter being an actual collection of data which conforms to the new rmgdb standard for storing chemical data. Should be at parity with the current RMG-database. Include helper functions and demo notebooks just showing how the data can be accessed, edited, etc.
Version 4.1.0 - we re-implement the 'asynchronous' parts of RMG-database, mainly training generation and tree fitting. This step will hopefully not change the actual database layout, but focus more on cleaning up the hand-written trees, as well as cleaning up the data in general. This would be an improved version of the data in the current RMG-database.
Version 4.2.0 - we re-implement the 'synchronous' parts of RMG-Py and RMG-website that rely on the data, in the process removing RMG-database and replacing it with rmgdatabase.

mjohnson541 commented 1 month ago

Cool! I definitely appreciate moving away from the current database format!

So my understanding is that: 1) We take our yaml database and use that to construct a running SQL database 2) When RMG starts it queries the SQL database for the libraries, trees and other models it needs and loads those into memory 3) After loading those RMG pretty much operates the same way it does now

If this is correct, particularly when running RMG, why not instead: 1) When RMG starts it loads the libraries, trees and other models from the yaml database 2) After loading those RMG pretty much operates the same way it does now

For a typical RMG user the second way seems to me to achieve the same result, but be much simpler and less error prone: less dependencies and avoids the trouble of maintaining an active database. This of course also doesn't preclude loading the yaml database into a SQL database for other purposes outside of running RMG.

Side note on tree generation: One could definitely build a workflow for retraining the rate coefficient SIDTs without too much work. The vast majority of the trees train virtually instantly or in a few minutes. However, a couple of the larger SIDTs can take hours to train, which may change some considerations.

ReactionMechanismGenerator / RMG-Py