Reaction Degeneracy Calculation

mliu49 commented 7 years ago

Working on making benzene bonds reactive in RMG led me to look at how RMG calculates reaction degeneracy, which is currently somewhat questionable.

There have been many previous issues posted regarding degeneracy:

140 regarding asymmetrical groups
142 regarding kinetics for multiple transition states
552 pointing out that this is a problem
833 regarding effect of symmetry on reaction rate
857 on degeneracy of reactions with identical reactants
and numerous reports of "unable to calculate degeneracy" which is typically due to incomplete resonance structure generation, leading to non-recognition of duplicate reactions

Current Method

Currently, the degeneracy of any reaction in RMG is equal to the number of subgraph isomorphic matches between top level groups in reaction family trees and the reactant species. As a result, there are many cases where the degeneracy is affected by the code with no basis in chemistry.

Top nodes that are not mutually exclusive
- Eg. bug in Disproportionation tree that was recently found/fixed
- If the top level nodes are not mutually exclusive, then the degeneracy will be multiplied by the number of nodes that a species matches.
Groups that include adjacent atoms not involved in reaction
- Eg. in 1,2_Insertion_carbene, one of the reactants is R_R, and one of the top level nodes requires 3 H atoms be attached to one end
- With three H atoms defined, there are six ways for that to match a methyl group through subgraph isomorphism, multiplying the degeneracy of that particular match by 6.
Labeled atoms for electron accounting
- Eg. in R_Addition_MultipleBond, *2 labels the atom that gains a radical following the addition
- In reactions where the products have resonance structures, there can be multiple (most commonly 2) ways to label where the electron goes, which are identical via resonance, so the degeneracy is doubled.
- This is the most relevant to aromatics, and is the reason I started looking into this

Proposed Changes

I've been working on a potential solution based on atom-tracking which solves some of these issues but not all of them. The underlying idea is there is only one transition state to obtain products with a given atom configuration, in the vast majority of cases. Reaction degeneracy would then be defined as the number of different but equivalent transition states that gives the same product. There are some deficiencies with this idea, which I want to get feedback on. Work-in-progress can be seen in the reactBenzeneTest branch.

The new process for determining degeneracy starts by numbering all atoms in the reactant species. This numbering is separate from atom labels, and numbering is consecutive across all the reactants. These atom numbers are retained after recipe application, and are used to compare products. The actual values are irrelevant, but simply used to identify atoms.

To help compare products, a new comparison method was created in the Molecule class called isIdentical() (which might need to be renamed, since there is a Group.isIdentical() method that is not analogous). Essentially, this differs from isIsomorphic() in that it requires that the atom numbering matches between the compared molecules.

So during the degeneracy calculation, identical products do not contribute to degeneracy, under the assumption that they represent a single transition state, while isomorphic but non-identical products do contribute.

Here are some simple examples:

H_abstraction from methane by an H radical
- The hydrogen atoms in methane get numbered 2, 3, 4, and 5
- After the reaction, there are 4 different cases, where each of [2, 3, 4, 5] get abstracted
- Because the atom numbers are different between the products, all 4 cases contribute to degeneracy
- This case remains unchanged from before
R_Addition_MultipleBond with H radical adding to the middle carbon in C=C=C
- The carbons are numbered 1, 2, 3 in order and the H radical is numbered 4
- After the reaction, there are two cases, one where the radical electron ends up on 1 and one where it ends up on 3
- Because the atom connectivity is the same in both cases, with 4 connected to 2 and none of the other atoms changing, the degeneracy of this reaction remains 1
- This is different from before, where the degeneracy would have been 2
Intra_H_migration in n-pentyl radical, from 1 position to either 2 or 4
- The carbons are numbered 1, 2, 3, 4, 5, and the H atoms on carbon 1 are numbered 6, 7, 8
- After the reaction, there are six cases, where each of [6, 7, 8] is attached to either [2, 4] (they could also be attached to 3, but those would be be a separate reaction)
- All 6 cases have different connectivity and contribute to degeneracy
- This is also unchanged from before

So in the H_abstraction example, each of the H atoms on methane could be abstracted by a different, but equivalent transition state. In the R_Addition_MultipleBond example, there should only be one transition state for the H to add to the middle carbon, since the electron configuration should not change the transition state.

In the Intra_H_migration example, there are two different transition states, one for migration to carbon 2 and one for migration to carbon 4, which should have different rates. With both the current algorithm and the changes I've made so far, both transition states would be combined into a single reaction, and the rate used would be arbitrarily the first one that came in the list. This could potentially be fixed by checking if a different template is being matched and keeping both reactions as duplicates.

In sum, the atom tracking method I've been working on addresses the issues I cited above, which are a result of how reaction families are set up. However, it does not address the issue of reactions with multiple transition states with different kinetics.

Discussion topics:

Does an atom tracking method make sense?
Would comparing templates be a safe way to identify different transition states?
Are there cases where atom numbering matches but there are actually multiple transition states?
Do we want to incorporate symmetry effects? Maybe in the future?
Any additional considerations?

mliu49 commented 7 years ago

Also, see the RMG-tests build, which shows many rate differences, but from a fairly small set of families.

goldmanm commented 7 years ago

I think the logic of the atom tracking method makes sense. If the training reactions account for degeneracy, this method should effect rates estimated from training reactions less than from rate rules.

For your second and third points, I do not think this method is flawless. We could be missing transition states and ignoring others. I think the R addition multiple bond example you showed might be a case of this error. Though I think this is an improvement from the previous methods.

I think symmetry could (and should) be added to rate estimation. This will likely throw-off many more kinetics, especially rate rule based. If we are ever able to move away from rate rules, this would be a easy winner to implement. As of now, I think it would be hard to argue that including symmetry will be only good.

Since we are adding a piece of data for every atom in the system, I think we should check to see if memory and time performance changes between the two models. @KEHANG, you mentioned potentially looking at phenyl dodecane for this?

mliu49 commented 7 years ago

It seems that we were mistaken in thinking that RMG calculates degeneracy for training reactions before adding rules for them. It directly takes the degeneracy value that is listed in the training reaction entry, and if nothing is listed, assumes a degeneracy of 1. This implies that training rates are not being accurately extended to similar species if the degeneracy specified in the training rate is different from what RMG would determine the degeneracy to be.

In the 1,2_Insertion_carbene example I gave previously, an example reaction would be CH2 inserting into CC to make CCC. The carbene could reasonably attack any of the C-H bonds or the C-C bond, so it should have a degeneracy of 7. However, it currently has a degeneracy of 36. This family only has data from training reactions, with manually input degeneracy values that correspond to the number of bonds that the carbene could attack. This means that that the incorrect degeneracy value does not compensate for itself, as it would if RMG calculated the degeneracy of training reactions.

rwest commented 7 years ago

I haven't thought this through, but shouldn't attacking a CC bond hit a different node than attacking a CH bond? Even if the product is the same, the "reaction" is different (the way RMG thinks)?

nyee commented 7 years ago

@rwest, This is related to #142 and #323 (a two year old pull request we really should think about getting in).

In the case that you have A+B --> C having two different templates, RMG will arbitrarily pick the kinetics from one template and just up the degeneracy by one. This is even more egregious when you consider something like an symmetrical intra-H migration: [CH2]CCCCC <=> CCCC[CH]C which can have either a 3-member TS or a 6-member TS that can make the same thing...

nyee commented 7 years ago

@mliu49 I previously worked on the 1,2_Insertion_carbene family and did notice that there is a lot of problems calculating the degeneracy. I think part of the problem stems from the fact that reaction occurs symmetrically over a bond as opposed to usually an atom. Therefore there is almost always a factor of two error because you can always label switch the labels 2 and 3, and still get exactly the same product (and even TS I believe). Edit this is mostly explained in #140, you listed above.

With regards to training reactions though, don't we have to label the reacting atoms, so that reacting over C-H and C-C would be stored under different entries?

mliu49 commented 7 years ago

@nyee, training reactions are labeled, so the rate will get put under the correct template. In the 1,2_Insertion_carbene family, I think all the training reactions are insertion into C-H bonds, so the C-C side of the tree is probably lacks any data. However, when we generate reactions, we'll aggregate both types and multiply one of the rates by the combined degeneracy.

I think ideally, we would have duplicate reactions with different kinetics: one for attacking C-C with degeneracy=1 and one for attacking C-H with degeneracy=6. I will look into combining #323 into this current branch.

mliu49 commented 7 years ago

I looked more into keeping reactions with different templates as duplicate reactions, including integrating the changes in #323.

It seems that keeping such reactions separate during reaction generation is trivial, but it affects other methods where we expected only one reaction to be generated.

In calculateDegeneracy, we specify a reaction, then run __generateReactions to get the degeneracy. We currently expect one reaction which is isomorphic to the input reaction. If the reaction has multiple TS's, then more than one reaction will be returned. To identify the one that corresponds to the input reaction, we can compare templates, which should work in general.
In generateReactions, if the family is its own reverse, then we try to generate the reverse reaction using __generateReactions. Again, we expect one reverse reaction, but if there are multiple TS's, then we will get more than one reaction. In this case, it is much more difficult to determine which is the corresponding reverse reaction. Template comparison doesn't work because the reverse template will be different. @connie was unable to come up with a good solution for this, and it may be part of the reason why that PR was never merged.

goldmanm commented 7 years ago

About the issues discussed in #323:

I was thinking how we could get around the calculateDegeneracy issue. We could require that the user specify a template (or in the future, a transition state structure) in the Reaction object. If multiple TS are returned and neither is specified, we can use logging.warning to inform the user. After the warning, calculateDegeneracy can just return the maximum degeneracy.

Regarding point 2, we have a transition state molecular structure for reactions, we can determine the reverse reaction by isomorphism of the transition states.

ReactionMechanismGenerator / RMG-Py

Reaction Degeneracy Calculation #876

140 regarding asymmetrical groups

142 regarding kinetics for multiple transition states

552 pointing out that this is a problem

833 regarding effect of symmetry on reaction rate

857 on degeneracy of reactions with identical reactants

Current Method

Proposed Changes