NREL / alfabet

Machine learning predictions of bond dissociation energy
https://ml.nrel.gov/bde/
Other
57 stars 15 forks source link

[NEED IMPROVEMENT] alfabet.model.predict is too slow #2

Closed IchiruTake closed 2 years ago

IchiruTake commented 3 years ago

I have used your source code to build some BDE prediction. But it is cost too long. The main reason we believed is that you should make some additional import by local method only to boost up loading function instead of getting from scratch. Especially if it is used on a large dataset. The fastest way to boost up is that you can try to build up molecules on tops and extract them as data instead of making multiple rebuilt. Moreover, another way to boost up is that you can initialize the numpy array at large scale then fit with data instead of calling multiple np.repeat, np.stack, np.concatenate pd.stack, .... If needs to remove duplicated you can retrieve the molecule index and used them as domain to validate radical within molecule only which is indeed so useful. Especially time-complexity is just at maximum O(N*k) where as N is the number of row and k is the number of BDE which can predict. You can built method such as predictByDefinedFile where the molecule, radicals and bond_index has been settle up to minimize calculation instead

IchiruTake commented 3 years ago

I have used your source code to build some BDE prediction. But it is cost too long. The main reason we believed is that you should make some additional import by local method only to boost up loading function instead of getting from scratch. Especially if it is used on a large dataset. The fastest way to boost up is that you can try to build up molecules on tops and extract them as data instead of making multiple rebuilt. Moreover, another way to boost up is that you can initialize the numpy array at large scale then fit with data instead of calling multiple np.repeat, np.stack, np.concatenate pd.stack, .... If needs to remove duplicated you can retrieve the molecule index and used them as domain to validate radical within molecule only which is indeed so useful. Especially time-complexity is just at maximum O(N*k) where as N is the number of row and k is the number of BDE which can predict. You can built method such as predictByDefinedFile where the molecule, radicals and bond_index has been settle up to minimize calculation instead

I have tested your 290k of unique BDE on local machine (drop_duplicate=True) and it got 90 mins to get the return with 12 cores on i7-8850H

pstjohn commented 2 years ago

Sorry for not responding here -- hopefully edits as of 0.3.1 should make the prediction much faster for a large set of inputs. it now allows you to run batches and, if tensorflow recognizes a GPU, should run these on a GPU.

The slow step is likely the molecule preprocessing though, which unfortunately is difficult to speed up for 290k molecules.

IchiruTake commented 2 years ago

I don't think the prediction is slow. The version I used here is 0.0.5 (TF 1.14, Keras 2.3.1, NFP 0.0.9, H5Py 2.10.0) which is the same as your original paper. The unfortunate is the generation of two fragments and the drop_duplicate is too costly in model.py. I have tested several version and this cannot be speed up. Based on your function, which worked on a series of molecule, you should

    # Main branch
    pred_df = pd.concat((get_fragments(smiles) for smiles in smiles_list))
    if drop_duplicates:
        pred_df = pred_df.drop_duplicates(["fragment1", "fragment2"]).reset_index(
            drop=True
        )
    # Version 0.0.5
        # Seperately fragment the molecules to find their valid bonds
    frag_results = Parallel(n_jobs=n_jobs, verbose=10 if verbose else 0)(
        delayed(get_fragments)(smiles) for smiles in smiles_list)
    frag_df = pd.concat(frag_results)

    pred_df = frag_df.merge(bde_df, on=['molecule', 'bond_index'],
                            how='left')

    if drop_duplicates:
        pred_df = pred_df.drop_duplicates([
            'fragment1', 'fragment2']).reset_index()

    return pred_df.drop('index', 1)

Sorry for not responding here -- hopefully edits as of 0.3.1 should make the prediction much faster for a large set of inputs. it now allows you to run batches and, if tensorflow recognizes a GPU, should run these on a GPU.

The slow step is likely the molecule preprocessing though, which unfortunately is difficult to speed up for 290k molecules.

IchiruTake commented 2 years ago

If you wanted, I am free and willing to rewrite (refactor to be exact) the whole module to speed up the process, but this is a complete story and let you to decide it.

P/s: One more thing I want to mention is that the scientific paper written is great and it seemed you have improved the model prediction but since the training cost is too large, I cannot build your local version along. I hope you can improve the overall architecture of ALFABET.

P/s: The second note I want to mention related to the outlier interpretation related to the ring-attached bond (a bond whose connected atom is a part of the ring/cycle, either aromatic or not). The C-H bond 5-Methoxyfuran-2-carboxylic acid (index 13 and 14) shares similar representation by ALFABET significantly differed by over 20 kcal/mol. I am not sure this is due to the erroneous on the DFT functional or anything else. I am not completely sure whether this molecule is belong to the training set or validation/testing set.

Related to the ring-attached bond, we have seen a large performance gap between ring-attached bond and its counterpart. This should be acknowledged when applying ALFABET although I generally believe that the effect of the ring cause the prediction to be harder.

Note the DFT referred to the 290k DFT dataset, Exp refered to the all experimental BDEs I can found in your Supplementary Materials (ignore the iBond ~360 BDEs and two erroneous bonds in the ZINC dataset 543). The accuracy * is the chemistry accuracy (< 1 kcal/mol) applied for the DFT dataset and 2.5 kcal/mol for the Exp dataset.

image

I can give you the Exp-merged dataset for further validation.

pstjohn commented 2 years ago

You're certainly welcome to send a pull request! A couple comments though

IchiruTake commented 2 years ago

Thank you, although I am not very sure about NFP, I will try to improve the speed of two radical generations for speedup. This issue would be moved to PR. Feel free to review for me afterward.

IchiruTake commented 2 years ago

You're certainly welcome to send a pull request! A couple comments though

  • for the speed of fragment.py, it's certainly likely that it could be made faster. Our priority there was accuracy, I've had odd cases when I don't convert to/from SMILES strings that atoms get lost, or I can't get a canonical representation of the fragment.

Thank you very much. Please check on the PR #9 . The PR is not done yet. But this should laid the foundation to start the replacement of fragment.py. Unless you make use of this legacy fragment.py. Otherwise, it is hard for me to refactor it.

  • We've primarily moved to a tensorflow 2.x version of the model; which changed a lot of the underlying GNN layers. Just a heads up in case you're still using a tensorflow 1.x version.

Thank you for your response, related to the change in the GNN layer. Does it means as changing your code but the underlying mechanism is maintained or change the mechanism under-the-neath?

  • It does make sense to drop duplicates before stacking all the dataframes together. I haven't really used this library for that many bonds before, so that step was always fast compared to the preprocessing or tensorflow forward passes.

Yes, you are right that if just making small molecule processing, the executing time is already enough but to reduce any external memory overhead and speed up, this is a bottleneck. Take the use of pd.Series when computing the stereo_centers. Changing to python dict instead of casting to pd.Series improved 1100 - 1250 times faster (100 bytes memory reduction over 100k BDEs) or 1 minutes faster for the whole dataset.


  • We're working on estimating bond strengths for ring-opening reactions, but the DFT treatment is more difficult. But it doesn't surprise me that the error is slightly higher in ring-adjacent bonds.

In fact I am attempting to do the same similar scientific researches about the BDE and use your dataset on my thesis (with proper citation) about create a model that does not rely on GNN architecture, costing only 30 minutes training on NVIDIA Quadro P1000 (which is faster than 160 times resource-efficient) than ALFABET to reach at least chemistry accuracy with three different approaches to resolve the input. (I am not proficient in Chemistry as I majored as Computer Science in a International University in HCMC, Vietnam). However, in my thesis, I don't try to squeeze and compete with the SOTA accuracy but to achieve zero-based acceptable accuracy and more reliable (fewer outliers) as the target reference is not by experimental design but the computational approximation DFT. My attempt is to balance every criteria so that any newbies, beginners, chemistry scholar, ML practitioners can train and customize their own network.

I am having three questions that: 1) Is it OK for me to use your DFT-based homolysis dataset (290k) for my thesis with proper citation ? 2) Does we care about the model training efficiency (time, memory, etc) in this field ? Due to my understanding, the current chemical space is expanding too quickly but the prime (high-quality) data is relatively scarce? I also take a look on BonDNet and it seemed that their PubChem-version BonDNet (using your data) is even costlier (based on my intuition). 3) Does my attempt have enough basis and motivation to be at least applied in practice ?

Thank you for your response.

Ichiru

pstjohn commented 2 years ago

Sorry for the slow reply, I'll look over the PR now. To your questions

  1. Definitely! Let me know if there's any data you can't find publicly available, or if you have any questions on it
  2. I suppose, but the vast majority of the computational cost for these projects is typically in generating the training data itself with DFT
  3. Sure, I think outliers are hard to avoid -- some can come from DFT convergence, others from odd geometries that the ML model fails to recognize. But you're certainly welcome to try alternative prediction methods
IchiruTake commented 2 years ago

Sorry for the slow reply, I'll look over the PR now. To your questions

  1. Definitely! Let me know if there's any data you can't find publicly available, or if you have any questions on it
  2. I suppose, but the vast majority of the computational cost for these projects is typically in generating the training data itself with DFT
  3. Sure, I think outliers are hard to avoid -- some can come from DFT convergence, others from odd geometries that the ML model fails to recognize. But you're certainly welcome to try alternative prediction methods

Thank you for your response. By your response: 1) The publicly available data I can found is already enough to make a complete comparison between three models: PubChem-BonDNet, legacy-ALFABET, and mine. Obviously more data can lead to better model, but it could make the comparison to be unfair. 2) Yes, I probably think so but on my knowledge, Gaussian 16 introduces the use of GPU that speed up 40% to 200-250% which is great. I hope that the later version can improve this performance and more applicable to different GPU architecture, although I don't have this applications to test. 3) Yes, currently I cannot find any obvious way to detect outliers without reference. The normal check I usually done is to compare with any available BDE on the molecule at the simple form and do ML-feature and QSAR analysis to see if outliers can be mitigated. The model I make here is just an alternative way to mitigate it or at least make chemists aware that these bonds are having problems under-the-hood.