RMG Improvement Proposal [RIP]: Make `RMG-database` a SQL Database [COMMENTS WELCOME!]

[!IMPORTANT] Please read and comment on this issue if you are a developer or user of RMG - we need lots of input!

The purpose of this issue is to centralize discussion around a significant change that @JacksonBurns and @jonwzheng are proposing for RMG-database. Also see the first RIP here: https://github.com/ReactionMechanismGenerator/RMG-Py/issues/2684

This issue is styled after the Python Enhancement Proposal (see PEP 733 for an example), thus the name 'RMG Improvement Proposal', or RIP for short.

`RMG-database` Today

RMG-database is a collection of Python files organized in a layout reflecting their contents. This includes:

statmech, transport, thermo, and solvation- contains libraries of literature results for individual molecules and then groups for making estimates
kinetics - similar to the above, except that groups belong to families which also contain rules and training data for generating said groups and rules.
surface - limited library with some surface data.
reference_sets - actually not Python files, but yml files containing "vetted" values for RMG/Arkane to fit data to perform BACs
quantum_corrections - Python dictionaries containing frequency scale factors, BACs, and AECs

In order to interact with RMG-database, one must have a working installation of RMG-Py and use its associated data classes to access numbers stored here. RMG-Py itself runs Python exec function on these files to load them into global memory, once per process.

Challenges with Today's `RMG-database`

This format introduces many 'hard' and 'soft' challenges, which are detailed below:

Using exec in this context (and in general) is bad coding practice
- Security risk, because it involves executing arbitrary user input code
- It relies on using the global state, which is difficult to debug and itself considered bad coding practice
Accessing RMG-database is only possible via RMG-Py, whose dependency requirements are so restrictive that it functionally forbids running RMG-database with anything else installed
Contributing to RMG-database is difficult since it requires formatted Python files, does not interface with any common data munging software, and as mentioned before is difficult to debug
Loading the entire database is slow, which can be seen when loading a database page for the first time since restart in RMG-website
Much of what we validate in the unit tests for RMG-database can be had for free with a SQL database, examples:
- kinetics_check_siblings_for_parents checks that siblings nodes are not also parent nodes, which can easily be validated at creation time in a SQL database
- check_surface_thermo_libraries_have_surface_attributes checks that certain entries have expected rows, which we can enforce by simple construction in a SQL database

Our Proposal and a Working Demo

@jonwzheng and I (@JacksonBurns) propose we overhaul RMG-database from the ground up as a SQL database. To that end, a working demo has been built showing how statmech could be converted into a SQL database, see this repository: https://github.com/JacksonBurns/rmgdb

In short, we do the following for each of the sub-databases in RMG-database that follow the library + family/group setup:

Define a set of Tables which represent each of the RMG Classes that are called within RMG-database, such as LinearRotor, and a 'base' table to hold all the calls to entry in each of the sub-databases. The aforementioned validation can be implemented here using SQL constructs like Triggers, Constraints, Key Relationships, etc Example of libraries schema Example of groups schema
exec the files in RMG-database, but trick them into generating our new Tables rather than RMG classes.

Dump the database into a plaintext format like .yml which would replace the Python source files we currently have, like this:

short_description: B3LYP/GTBas3
long_description: ''
label: HF
adjacency_list: |2

1 F u0 p3 c0 {2,S}
2 H u0 p0 c0 {1,S}
statmech:
energy: -282.308
energy_unit: kJ/mol
modes:
ideal_gas_translation:
  mass: 20.0062
  mass_unit: amu
linear_rotor:
  linear_inertia: 0.809097
  linear_inertia_unit: amu*angstrom^2
  linear_symmetry: 1.0
harmonic_oscillator:
  harmonic_freq_unit: cm^-1
  harmonic_freq_1: 4113.43

Load the database from the .yml files, enabling users to contribute to RMG-database by just editing the .yml files (proper configuration and formatting can be enforced by GitHub actions).

Critically, this would allow the users to run one-liner commands with only pandas installed to then interact with the RMG-database, like this:

(rmgdb) (base) jackson@jackson-Precision-7540:~/rmgdb/data/rmgdatabase/statmech$ python
Python 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.read_sql("SELECT * FROM statmech_libraries_view", "sqlite:///statmech.db")
      id         name short_description long_description            label  ... harmonic_freq_8  harmonic_freq_9 harmonic_freq_10  harmonic_freq_11  harmonic_freq_12
0      0  halogens_G4      B3LYP/GTBas3                                HF  ...             NaN              NaN              NaN               NaN               NaN
1      1  halogens_G4      B3LYP/GTBas3                               HBr  ...             NaN              NaN              NaN               NaN               NaN
2      2  halogens_G4      B3LYP/GTBas3                               HCl  ...             NaN              NaN              NaN               NaN               NaN
3      3  halogens_G4      B3LYP/GTBas3                                F2  ...             NaN              NaN              NaN               NaN               NaN
4      4  halogens_G4      B3LYP/GTBas3                               FCl  ...             NaN              NaN              NaN               NaN               NaN
..   ...          ...               ...              ...              ...  ...             ...              ...              ...               ...               ...
189  189  halogens_G4      B3LYP/GTBas3                   BrC(Br)DC(Br)Br  ...         496.116          632.671          752.302           866.301           1587.25
190  190  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Cl)Cl  ...         538.764          775.262          891.705           968.279           1625.20
191  191  halogens_G4      B3LYP/GTBas3                   ClC(Br)DC(Br)Br  ...         507.498          664.464          795.117           899.928           1596.78
192  192  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Br)Br  ...         520.415          715.925          810.937           933.346           1606.58
193  193  halogens_G4      B3LYP/GTBas3                   ClC(Cl)DC(Cl)Br  ...         528.548          737.885          864.898           949.118           1616.44

[194 rows x 33 columns]

This would make it trivially easy to access RMG-database and its wealth of chemical data.

There are further benefits on the RMG-Py side of things. Navigating the decision tree structure is dramatically faster than the current setup because it uses the SQL adjacency list layout for storing hierarchical data, enabling tree navigation by simple matching of integers. Accessing the data in this way will also have a massive positive impact on RMG-Py's memory consumption - the current setup requires each parallel process to load the entire database into memory, whereas this would be shared among all processes and allow easy loading of only the required data.

Next Steps, Drawbacks, and Open Questions

The amount of value of the data (and how difficult it is to get to it) in RMG-database makes this step worth doing on its own. Part of the reason that the linked demo already exists is because @jonwzheng and I will likely see it to its end even if just for our own usage, since we would like to be able to access RMG-database in other projects.

The purpose for this issue, then, is to discuss what issues this could bring up with RMG-Py and how we can mitigate them during the design process.

Difficulty of Integration with `RMG-Py` and `RMG-website`

The database is arguably the most important piece of the RMG-Py-puzzle, and so it is used throughout the source code in many different functions. This is not a critical issue, since re-implementing any of the needed functionality will just be a matter of effort, but it is worth mentioning. Also promising is that most of this functionality never changes (i.e., we will always need to find all the ancestors for a given node, a function which would never change), so once we implement it in SQL and wrap it in Python it can just sit.

We would need to update all of our new user documentation. Workshop materials from previous years will also become out-of-date.

More serious is the integration with the various notebooks and scripts people have assembled over the years to create RMG-database. While we can do the best we can to provide examples of changing the main RMG code to work with the new database, things like the group fitting notebooks will need serious overhauls both on the main branch and for people locally, as well as the RMG-website source code.

On that note...

Backwards Compatibility

This would be totally backwards incompatible with previous versions. Outstanding PRs, as well, would need to be restarted completely in order to work.

...and...

What do we Keep?

This is perhaps the biggest open question. From our understanding, the libraries (used as lookup tables during RMG simulations) in each of the sub-databases contain data scraped from literature/simulated by us for various chemicals and chemicals reactions - we would definitely keep those. The training directories, as well as the rules and groups files are less obvious to handle. We believe that the training reactions can be generated automatically from the libraries, though that has not been done for all meaning that some of them are hand constructed. Similarly, some of the rules and groups appear to be hand-built whereas others are machine generated.

We ask this question for two reasons - it will inform the design of the database, and because it will determine the scope of the work. If it turns out that we want a way to automatically refit all the trees whenever we push new data, thus replacing the rules, groups, and training (?), we could incorporate that into this larger effort.

Please let us know your thoughts and any suggestions you have about how to best approach this - especially those related to the kinetics database, specifically the workflow of library -> training -> rules/groups. After this issue has been opened, we will schedule a board meeting to discuss the way forward. Thank you!

Thank you both for your speedy feedback! I will quote reply to individual points to make sure we cover everything.

Matt's Notes

User/Developer Vision: How do you guys envision this changing how users and developers interact with and query the database?

For users who just want to contribute a couple known reactions, thermochemical values, etc. they would be able to clone RMG-database, edit the YAML files to add/modify their results (made easier by the reduced formatting), and then open a PR. They would never be responsible for generating the database; GitHub Continuous Integration would handle that.

For users who want to add a new library for their own local installation, the procedure would be the same as it is now, with the exception that they would need to first add the data and then generate the database. We plan to make this generation code part of the rmgdb standard, to which RMG-database will conform, which would make this step easy.

For developers, the workflow would be largely the same as it is now, but just editing in YAML format. We would thoroughly document the process for building the actual database file so that they can ensure their changes are correct and unit tests are passed, though that will also be done through GitHub actions anyway.

How does this change how we look at, update, query, modify, version control etc. the RMG-database?

Reading from the database would take place via Python functions that @jonwzheng and I write as part of the rmgdb standard. Examples include the function we wrote to load the entire statmech database - it is pure Python, though it uses the underlying SQL engine behind the scenes.

Modifying the database would happen by modifying YAML files, and then rebuilding the database from the YAML files. It is these same YAML files containing the data which would be version controlled. This was chosen so that git diffs would still be meaningful, though YAML is not the final choice (open to suggestions), and so that users could make edits without learning any 'style' (like the current Python formatting).

Do students need to learn SQL to do these things?

Our goal is that users can interact with rmgdb with no SQL at all. Once we have the Tables, Constraints, etc. set up we do not expect them to change over time, since the data we want to track will be necessarily consistent over time. The algorithms we use on the database (e.g. "find all descendants of a node" or "check if nodes are siblings") can be implemented as part of the rmgdb standard and remain static, much like their current implementations in RMG-Py. If for any reason the SQL schema or views do need to be modified, making changes is simple and can be accomplished through the sqlalchemy API (examples: schema and views).

Programmatic/Computational Vision: How do you guys envision this changing how RMG-Py programmatically and computationally interacts with the RMG-database? How does this affect how we write code in RMG-Py? What queries are being made and how often? Are there speed/computational concerns? If so how what benchmarks would you run in the future, if you haven't already, to resolve that?

I covered this partially above, but will restate - our goal is that many of the functions that RMG-Py uses for navigating the database and accessing records can be implemented in rmgdb and remain there. To better address how this standard would impact coding in RMG-Py, see the timeline at the end of this reply.

As far as speed/scaling/benchmarks - we will keep these suggestions in mind and work something up!

Installation Vision: How do you guys envision this changing how RMG is installed? Databases present a number of more unique installation challenges beyond simply installing the software for the database...How does the database get setup in a typical installation? Installing and keeping databases alive on clusters particularly that aren't owned by the user can be tricky, forbidden or require special permission...how do we help users in that situation?

For the time being, we have chosen the sqlite database as the actual implementation of the SQL standard. This is very lightweight and is included as part of Python's standard library, so no new dependencies except sqlalchemy (our chosen SQL wrapper code, which is robust and well established) are required. We envision having an rmgdatabase package (like we have now) that includes the rendered .db file, which can then be accessed via the pre-written Python functions included in the rmgdb standard.

I believe that for remote computing systems, the only sqlite challenge would be file locking, though this is something that can be worked around. sqlite does not require running a server, making HTTPS requests, etc. - it is all done in memory, locally.

Alon's Notes

[...] If we can quantify, or even just reasonably estimate, the input data uncertainties and then propagate those into the actual database thermo-kinetic estimations -- that would be a huge thing. [...]

This is definitely something we can look to include. When we look to edit the tree fitting and training generation code, we can consider adding fields to the database and steps in the algorithm which achieve these goals.

Another point to consider is that our kinetics data has Tmin and Tmax. They are not always given, which is sometimes understandable. But in my understanding they are not used by RMG when making estimations. Maybe we could implement this as well into the new database.

When we get to the stage of integrating RMG-Py with rmgdb we can keep this mind! (See last part of this giant comment).

[...] We could consider adding an attribute for a library reaction of whether it is appropriate for gas/liquid/or both.

Great suggestion! This actually lends itself very well to some SQL programming constructs. What I imagine is that we have one table that stores our known Reactions. We then have a separate table, which is linked to that Reactions table by the unique identifier for each reaction, that stores Correction terms. It is then trivial to write a SQL command which loads reactions with (or without) corrections, etc. that can then be wrapped in Python and made easily available to RMG.

Our Proposed Path Forward

After some offline discussion, @jonwzheng and I have come up with this proposed plan for following through on this RIP.

Database Version 4.0.0 (one major version above the current RMG-database) - a complete rewrite of just RMG-database's data into rmgdatabase (a separate repository), the latter being an actual collection of data which conforms to the new rmgdb standard for storing chemical data. Should be at parity with the current RMG-database. Include helper functions and demo notebooks just showing how the data can be accessed, edited, etc.
Version 4.1.0 - we re-implement the 'asynchronous' parts of RMG-database, mainly training generation and tree fitting. This step will hopefully not change the actual database layout, but focus more on cleaning up the hand-written trees, as well as cleaning up the data in general. This would be an improved version of the data in the current RMG-database.
Version 4.2.0 - we re-implement the 'synchronous' parts of RMG-Py and RMG-website that rely on the data, in the process removing RMG-database and replacing it with rmgdatabase.

JacksonBurns / RMG-actually-a-database