Thermo for training reactions in reverse

davidfarinajr commented 3 years ago

If training reactions are written in reverse, we use thermo to fit the kinetics in the forward direction. If the thermo is poor, the rate we train the kinetics tree with will be poor as well.

For example: Training reaction 2 in Birad_R_Recombination NO2_p <=> NO + O is in reverse direction. To reverse it, we need to get the thermo for NO2, NO, and O. O is in the primaryThermoLibrary, but NO and NO2 are not. If NO and NO2 are not in any of the thermo libraries specified in the input file when running RMG, then RMG uses group additivity to estimate the thermo, and since these estimates are not so good, the kinetics will be not so good. Therefore, even if we are running RMG without nitrogen in our system, we need good thermo for NO and NO2, otherwise the Birad_R_Recombination tree might be trained with poor kinetics.

Possible solutions:

Only write training reactions in the forward direction. That way, our RMG model will not be dependent on the thermo for the species in reverse training reactions.
Put all the species in all the reverse training reactions into a "Training_reactions" thermo library and load this library by default every time rmg is run.

davidfarinajr commented 3 years ago

I am not sure how widespread this problem is (how many training reactions are in reverse with species not in "commonly used" thermo libraries), but I think it could become more widespread as we add more training reactions for different systems (halogens, catalysis, sulfur, etc.)

mjohnson541 commented 3 years ago

I don't think is quite as big of a problem as it first appears as the bigger problem is usually that you're using rates from one chemistry to estimate rates of another chemistry.

However, the automated rate trees won't have this issue (because they "compile" their rules during training) so if this is a problem the easiest way to solve it is to simply convert it to an automated tree.

davidfarinajr commented 3 years ago

I agree, and yes converting to automated trees would fix this, but perhaps we should at least raise a warning if we use group additivity to estimate thermo for a training reaction in reverse if we are making the kinetics rules at the start of RMG job. This would have saved me time anyways in trying to figure out why my Birad_R_Recombination rates were way too fast

mjohnson541 commented 3 years ago

That might work well for Birad_R_Recombination, that has only a few reactions with mostly small species, but when you do this for everything I'm pretty sure you'll see way way more of these cases than any human wants to see and they probably won't know what to do about it.

rwest commented 3 years ago

There seems to be one argument that it's not a big problem, and another argument that there will be way more cases than any human wants to see.

Maybe we could run a script through the current database to see how common it actually is?

davidfarinajr commented 3 years ago

Sure, I did this yesterday using my halogens database branch, but all of the training reactions in master should also be on that branch. This spreadsheet has ~300 species in reverse training reactions from the default families (I didn't look at surface families).

species_from_reverse_training_reactions.xlsx

davidfarinajr commented 3 years ago

['C3',
 'thermo_DFT_CCSDTF12_BAC',
 'SABIC_aromatics',
 'NISTThermoLibrary',
 'BurcatNS',
 'JetSurF1.0',
 'JetSurF2.0',
 'SulfurHaynes',
 'C10H11',
 'DFT_QCI_thermo',
 'Lai_Hexylbenzene',
 'naphthalene_H',
 'Fulvene_H',
 'CBS_QB3_1dHR',
 'CH',
 'vinylCPD_H',
 'Narayanaswamy',
 'primaryThermoLibrary']

This is a list of thermo libraries from that spreadsheet that contain thermo for at least one training species. I added this list to the end of my thermoLibraries list in my input file, and that seems to have improved things

davidfarinajr commented 3 years ago

This list is dependent on the thermo library ordering, but it should still cover most of the training species.

This problem will go away once all of the families in that spreadsheet are autogenerated. However, in the meantime, I think it's best to add these thermo libraries to rmg input files.

davidfarinajr commented 3 years ago

oops, forgot to check for isomorphic species in that list. Number of unique species is ~230

mjohnson541 commented 3 years ago

XD What I mean is that GAV is used to reverse training reactions way more often than it will significantly impact estimation during your run because a lot of the time GAV will work well enough and normally the fact that some of a different chemistry's reactions are reversed poorly isn't going to affect estimates in other chemistries most of the time (and when it does it's usually because we really don't have good training data not because reversing it properly solves the problem).

I also don't think it's good to just assume that RMG library values are better than GAV. I'm not incredibly familiar with all of the libraries above, but I at least wouldn't trust at least C3 or CH inherently over GAV estimates.

davidfarinajr commented 3 years ago

Yes, free energy estimates at 298K with GAVs are within 2 kcal/mol of library values for most of them. So most of the time, GAV estimate is good enough. However, there are a few cases where the GAV estimates are not good, particularly for small stuff like NO and some rings. So I don't think it's a widespread issue, but we should probably include NO in primaryThermoLibrary so we don't use GAV for it

ReactionMechanismGenerator / RMG-database

Thermo for training reactions in reverse #461