connorcoley / rexgen_direct

Template-free prediction of organic reaction outcomes
GNU General Public License v3.0
150 stars 68 forks source link

How do you balance reaction stoichiometry? #6

Open bionicles opened 5 years ago

bionicles commented 5 years ago

Just curious... how could we go through the data file and balance these reactions to make sure there are equal numbers of atoms on both sides?

I want to make simulations of them, but it's hard if they're imbalanced

Many thanks, bionicles

connorcoley commented 5 years ago

The code expects that it will be provided atom-mapped reaction SMILES where all atoms in the product molecule(s) have unique atom map numbers that are present in the reactants. The code will break if there are duplicated atom map numbers (e.g., indicating reactant stoichiometry of two by using those map numbers in the product multiple times) as opposed to duplicated reactants (e.g., indicating reactant stoichiometry by providing two explicit copies of the reactants using different numbering).

This does not mean that the reactions are fully balanced. The datasets we work with rarely contain information about byproducts (e.g., salts, water) and so the code does not expect them. The parent data source can be found here. There is no open-source reaction dataset with truly balanced reactions. This paper uses a proprietary set that required hand curation.

amrhamedp commented 4 years ago

which code was used for atom mapping to prepare the data set ?

connorcoley commented 4 years ago

The dataset was previously mapped using Indigo

On Sat, Feb 22, 2020 at 19:40 amrhamedp notifications@github.com wrote:

which code was used for atom mapping to prepare the data set ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/connorcoley/rexgen_direct/issues/6?email_source=notifications&email_token=ABAEXJS645H5WDH2HHWKD53REHAZFA5CNFSM4H6TFCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVOIFA#issuecomment-590013460, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAEXJS2TIUGQJ5DZC2NWTLREHAZFANCNFSM4H6TFCSQ .

bionicles commented 4 years ago

If reactions are coded with balanced and matched PDB files then simulating them with NNs is trivial because you could just plug them into the same exact pipeline for proteins, otherwise you need either to balance the reactions (less ideal honestly) or figure out a loss function which handles unbalanced / unmatched IO (super valuable if done abstractly)

we got pretty bogged down in using DNNs for atomistic simulations of chemistry due to permutation invariance issues, and decided to move on to more "boring" stuff (data security) but for future reference to interested folks it could possibly work with a Fused Gromov Wasserstein loss function on the neural network; compare unmatched molecule graphs with transport theory, with adding the "feature distance" between the atoms features as another transport cost to minimize

there is a critical need for alignment-free and permutation invariant loss functions for neural networks which move atoms (ATOM MOVER DISTANCE). some past work uses local kernels for this, which is neat, it would be cool to try multiscale kernels or FGW. If such a loss function existed then you could train 1 neural network to do quantum physics, organic chemistry, and biomolecular engineering. Not sure if this is worth it in the era of systems and synthetic biology (why make a molecule when you can make a circuit?) but would be cool to look at. I'd like to circle back to this in a few months and use PyTorch + Python Optimal Transport if anyone's keen, I already made the OpenAI gym env for PyMol, but Tensorflow 2.0 wasnt as super fun as I hoped

https://tvayer.github.io/materials/Titouan_Marseille_2019.pdf

thanks for your work, feel free to keep this open or close it