ReactionMechanismGenerator / RMG-Py

Python version of the amazing Reaction Mechanism Generator (RMG).
http://reactionmechanismgenerator.github.io/RMG-Py/
Other
376 stars 226 forks source link

Reaction Degeneracy Calculation #876

Closed mliu49 closed 7 years ago

mliu49 commented 7 years ago

Working on making benzene bonds reactive in RMG led me to look at how RMG calculates reaction degeneracy, which is currently somewhat questionable.

There have been many previous issues posted regarding degeneracy:

Current Method

Currently, the degeneracy of any reaction in RMG is equal to the number of subgraph isomorphic matches between top level groups in reaction family trees and the reactant species. As a result, there are many cases where the degeneracy is affected by the code with no basis in chemistry.

Proposed Changes

I've been working on a potential solution based on atom-tracking which solves some of these issues but not all of them. The underlying idea is there is only one transition state to obtain products with a given atom configuration, in the vast majority of cases. Reaction degeneracy would then be defined as the number of different but equivalent transition states that gives the same product. There are some deficiencies with this idea, which I want to get feedback on. Work-in-progress can be seen in the reactBenzeneTest branch.

The new process for determining degeneracy starts by numbering all atoms in the reactant species. This numbering is separate from atom labels, and numbering is consecutive across all the reactants. These atom numbers are retained after recipe application, and are used to compare products. The actual values are irrelevant, but simply used to identify atoms.

To help compare products, a new comparison method was created in the Molecule class called isIdentical() (which might need to be renamed, since there is a Group.isIdentical() method that is not analogous). Essentially, this differs from isIsomorphic() in that it requires that the atom numbering matches between the compared molecules.

So during the degeneracy calculation, identical products do not contribute to degeneracy, under the assumption that they represent a single transition state, while isomorphic but non-identical products do contribute.

Here are some simple examples:

So in the H_abstraction example, each of the H atoms on methane could be abstracted by a different, but equivalent transition state. In the R_Addition_MultipleBond example, there should only be one transition state for the H to add to the middle carbon, since the electron configuration should not change the transition state.

In the Intra_H_migration example, there are two different transition states, one for migration to carbon 2 and one for migration to carbon 4, which should have different rates. With both the current algorithm and the changes I've made so far, both transition states would be combined into a single reaction, and the rate used would be arbitrarily the first one that came in the list. This could potentially be fixed by checking if a different template is being matched and keeping both reactions as duplicates.

In sum, the atom tracking method I've been working on addresses the issues I cited above, which are a result of how reaction families are set up. However, it does not address the issue of reactions with multiple transition states with different kinetics.

Discussion topics:

mliu49 commented 7 years ago

Also, see the RMG-tests build, which shows many rate differences, but from a fairly small set of families.

goldmanm commented 7 years ago

I think the logic of the atom tracking method makes sense. If the training reactions account for degeneracy, this method should effect rates estimated from training reactions less than from rate rules.

For your second and third points, I do not think this method is flawless. We could be missing transition states and ignoring others. I think the R addition multiple bond example you showed might be a case of this error. Though I think this is an improvement from the previous methods.

I think symmetry could (and should) be added to rate estimation. This will likely throw-off many more kinetics, especially rate rule based. If we are ever able to move away from rate rules, this would be a easy winner to implement. As of now, I think it would be hard to argue that including symmetry will be only good.

Since we are adding a piece of data for every atom in the system, I think we should check to see if memory and time performance changes between the two models. @KEHANG, you mentioned potentially looking at phenyl dodecane for this?

mliu49 commented 7 years ago

It seems that we were mistaken in thinking that RMG calculates degeneracy for training reactions before adding rules for them. It directly takes the degeneracy value that is listed in the training reaction entry, and if nothing is listed, assumes a degeneracy of 1. This implies that training rates are not being accurately extended to similar species if the degeneracy specified in the training rate is different from what RMG would determine the degeneracy to be.

In the 1,2_Insertion_carbene example I gave previously, an example reaction would be CH2 inserting into CC to make CCC. The carbene could reasonably attack any of the C-H bonds or the C-C bond, so it should have a degeneracy of 7. However, it currently has a degeneracy of 36. This family only has data from training reactions, with manually input degeneracy values that correspond to the number of bonds that the carbene could attack. This means that that the incorrect degeneracy value does not compensate for itself, as it would if RMG calculated the degeneracy of training reactions.

rwest commented 7 years ago

I haven't thought this through, but shouldn't attacking a CC bond hit a different node than attacking a CH bond? Even if the product is the same, the "reaction" is different (the way RMG thinks)?

nyee commented 7 years ago

@rwest, This is related to #142 and #323 (a two year old pull request we really should think about getting in).

In the case that you have A+B --> C having two different templates, RMG will arbitrarily pick the kinetics from one template and just up the degeneracy by one. This is even more egregious when you consider something like an symmetrical intra-H migration: [CH2]CCCCC <=> CCCC[CH]C which can have either a 3-member TS or a 6-member TS that can make the same thing...

nyee commented 7 years ago

@mliu49 I previously worked on the 1,2_Insertion_carbene family and did notice that there is a lot of problems calculating the degeneracy. I think part of the problem stems from the fact that reaction occurs symmetrically over a bond as opposed to usually an atom. Therefore there is almost always a factor of two error because you can always label switch the labels 2 and 3, and still get exactly the same product (and even TS I believe). Edit this is mostly explained in #140, you listed above.

With regards to training reactions though, don't we have to label the reacting atoms, so that reacting over C-H and C-C would be stored under different entries?

mliu49 commented 7 years ago

@nyee, training reactions are labeled, so the rate will get put under the correct template. In the 1,2_Insertion_carbene family, I think all the training reactions are insertion into C-H bonds, so the C-C side of the tree is probably lacks any data. However, when we generate reactions, we'll aggregate both types and multiply one of the rates by the combined degeneracy.

I think ideally, we would have duplicate reactions with different kinetics: one for attacking C-C with degeneracy=1 and one for attacking C-H with degeneracy=6. I will look into combining #323 into this current branch.

mliu49 commented 7 years ago

I looked more into keeping reactions with different templates as duplicate reactions, including integrating the changes in #323.

It seems that keeping such reactions separate during reaction generation is trivial, but it affects other methods where we expected only one reaction to be generated.

goldmanm commented 7 years ago

About the issues discussed in #323:

I was thinking how we could get around the calculateDegeneracy issue. We could require that the user specify a template (or in the future, a transition state structure) in the Reaction object. If multiple TS are returned and neither is specified, we can use logging.warning to inform the user. After the warning, calculateDegeneracy can just return the maximum degeneracy.

Regarding point 2, we have a transition state molecular structure for reactions, we can determine the reverse reaction by isomorphism of the transition states.