ReactionMechanismGenerator / RMG-Py

Python version of the amazing Reaction Mechanism Generator (RMG).
http://reactionmechanismgenerator.github.io/RMG-Py/
Other
397 stars 228 forks source link

machine learning in thermo module #1971

Closed jgiaccai closed 1 year ago

jgiaccai commented 4 years ago

Topic

General area which your question is related to.

Context

I have been successfully using the Thermo module to estimate thermo properties of large PAH molecules using group additivity. We've been getting some unexpected results, and after reading in the online documentation that I may be better off using machine learning instead. When submitting molecule to thermo using machine learning, I get the error below. I think this is likely created because the molecule can't be kekulized.

Question

I know how the group additivity works based on the articles published by Yu et al. Is there any documentation (articles or other) on the machine learning methodology?

Can anyone confirm that the inability to kekulize the molecule is what is leading to the machine learning Thermo module not working?

Bug Description

When running the ML in the Thermo module I get the following error and no output is created.

Traceback (most recent call last): File "../scripts/thermoEstimator.py", line 103, in run_thermo_estimator(input_file, args.library) File "../scripts/thermoEstimator.py", line 70, in run_thermo_estimator submit(species) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/thermo/thermoengine.py", line 174, in submit spc.thermo = evaluator(spc, solvent_name=solvent_name) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/thermo/thermoengine.py", line 159, in evaluator thermo = generate_thermo_data(spc, solvent_name=solvent_name) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/thermo/thermoengine.py", line 124, in generate_thermo_data thermo0 = thermodb.get_thermo_data(spc) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/data/thermo.py", line 1319, in get_thermo_data ml_settings) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/data/thermo.py", line 1760, in get_thermo_data_from_ml thermo0 = ml_estimator.get_thermo_data_for_species(species) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/ml/estimator.py", line 111, in get_thermo_data_for_species return self.get_thermo_data(species.molecule[0]) File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/ml/estimator.py", line 79, in get_thermo_data hf298 = self.hf298_estimator(molecule.smiles)[0][0] File "/Users/jennifergiaccai/Documents/gradschool/PAHholdingcell/RMG/RMG-Py/rmgpy/ml/estimator.py", line 148, in estimator [chemprop.data.MoleculeDatapoint(line=[smi], args=args)] File "/anaconda3/envs/rmg_env/lib/python3.7/site-packages/chemprop-0.0.1-py3.7.egg/chemprop/data/data.py", line 48, in init File "/anaconda3/envs/rmg_env/lib/python3.7/site-packages/chemprop-0.0.1-py3.7.egg/chemprop/mol_utils.py", line 19, in str_to_mol rdkit.Chem.rdchem.KekulizeException: Can't kekulize mol. Unkekulized atoms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

How To Reproduce

I have been able to use ML with some sets of molecules submitted, my guess is that it is related to not being able to kekulize the molecule. I'm working with a set of potential large PAH molecules that were generated with another program. It doesn't specify single and double bond location, which may lead to PAH that are chemically unstable.

Expected Behavior

If ML cannot estimate thermo properties for a molecule I would have expected it to skip the molecule and still produce a library with the other molecules that are successful. Or an error message that stated that not being able to kekulize the molecule means ML won't be successful. It would also be helpful to know which molecule it isn't able to kekulize.

Installation Information

Describe your installation method and system information.

kspieks commented 4 years ago

Hi Jennifer,

Sorry for the delay. I haven't used the ML estimator yet, but I'll try to help. Could you try replacing code between 1753-1760 in RMG-Py/rmgpy/data/thermo.py with the following? I believe that should help improve some of the expected behavior, such as printing the molecule causing the error while not crashing the script.

try:
    if molecule.is_radical():
        thermo = [self.estimate_radical_thermo_via_hbi(mol, ml_estimator.get_thermo_data) for mol in species.molecule]
        H298 = np.array([tdata.H298.value_si for tdata in thermo])
        indices = H298.argsort()
        species.molecule = [species.molecule[ind] for ind in indices]
        thermo0 = thermo[indices[0]]
    else:
        thermo0 = ml_estimator.get_thermo_data_for_species(species)
except Exception as e:
    # if rdkit throws a Keukulize Exception, print the error and the molecule, and just return None.
    print(f'\n\nError: Could not obtain thermo for {species.label} due to the following error: \n{e}\n\n')
    return None
github-actions[bot] commented 1 year ago

This issue is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant issue, otherwise it will automatically be closed in 30 days.