Open bionicles opened 5 years ago
The code expects that it will be provided atom-mapped reaction SMILES where all atoms in the product molecule(s) have unique atom map numbers that are present in the reactants. The code will break if there are duplicated atom map numbers (e.g., indicating reactant stoichiometry of two by using those map numbers in the product multiple times) as opposed to duplicated reactants (e.g., indicating reactant stoichiometry by providing two explicit copies of the reactants using different numbering).
This does not mean that the reactions are fully balanced. The datasets we work with rarely contain information about byproducts (e.g., salts, water) and so the code does not expect them. The parent data source can be found here. There is no open-source reaction dataset with truly balanced reactions. This paper uses a proprietary set that required hand curation.
which code was used for atom mapping to prepare the data set ?
The dataset was previously mapped using Indigo
On Sat, Feb 22, 2020 at 19:40 amrhamedp notifications@github.com wrote:
which code was used for atom mapping to prepare the data set ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/connorcoley/rexgen_direct/issues/6?email_source=notifications&email_token=ABAEXJS645H5WDH2HHWKD53REHAZFA5CNFSM4H6TFCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVOIFA#issuecomment-590013460, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAEXJS2TIUGQJ5DZC2NWTLREHAZFANCNFSM4H6TFCSQ .
If reactions are coded with balanced and matched PDB files then simulating them with NNs is trivial because you could just plug them into the same exact pipeline for proteins, otherwise you need either to balance the reactions (less ideal honestly) or figure out a loss function which handles unbalanced / unmatched IO (super valuable if done abstractly)
we got pretty bogged down in using DNNs for atomistic simulations of chemistry due to permutation invariance issues, and decided to move on to more "boring" stuff (data security) but for future reference to interested folks it could possibly work with a Fused Gromov Wasserstein loss function on the neural network; compare unmatched molecule graphs with transport theory, with adding the "feature distance" between the atoms features as another transport cost to minimize
there is a critical need for alignment-free and permutation invariant loss functions for neural networks which move atoms (ATOM MOVER DISTANCE). some past work uses local kernels for this, which is neat, it would be cool to try multiscale kernels or FGW. If such a loss function existed then you could train 1 neural network to do quantum physics, organic chemistry, and biomolecular engineering. Not sure if this is worth it in the era of systems and synthetic biology (why make a molecule when you can make a circuit?) but would be cool to look at. I'd like to circle back to this in a few months and use PyTorch + Python Optimal Transport if anyone's keen, I already made the OpenAI gym env for PyMol, but Tensorflow 2.0 wasnt as super fun as I hoped
https://tvayer.github.io/materials/Titouan_Marseille_2019.pdf
thanks for your work, feel free to keep this open or close it
Just curious... how could we go through the data file and balance these reactions to make sure there are equal numbers of atoms on both sides?
I want to make simulations of them, but it's hard if they're imbalanced
Many thanks, bionicles