Parameters relevant for speed in mutate_mol

andrewdchen commented 3 years ago

Hi! Thanks for maintaining this repo! I was wondering which parameters were relevant for speeding up mutate_mol? I've found better speeds by increasing n_cores but if I wanted to speed it up even more would decreasing radius help too?

DrrDom commented 3 years ago

Hi! It depends what is the context of "speed up".

I tried to make it fast but there is definitely space for improvement. However, each next improvement of speed require more efforts :)
The only way how to truly increase the overall speed without code modifications is to increase the number of cores.
But the speed can be increased by decreasing the number of replacements. If this is acceptable you may: a) set max_replacements to 100 or 1000 (it will choose random replacements from all available) b) set min_freq to 5, 10 or more (to use only more frequently occurred replacements) c) increase the radius (but this will make more conservative replacements) d) play with parameters controlling the size of replaced and replacing fragments (decreasing fragment size you will make smaller steps in chemical space) e) use a smaller fragment database I think that (a) is the most reasonable from this list followed by (b). (c)-(e) choices depend on a project.

andrewdchen commented 3 years ago

Thanks! Feel free to close this issue. But on a related note what work do you think needs to be done to speed it up more? Would it be possible to leverage GPUs? Feel free to email me, the team I am working with would love to help!

DrrDom commented 3 years ago

I have to profile code to remember all details and pitfalls. I'll post those results here to enable a more objective discussion.

What I remember is that I implemented a brute force replacement procedure which makes multiple replacements leading to the same compound. Therefore, duplicates are generated and filtered internally. This is definitely waste of time. But I do not remember the proportion of duplicates and I do not know how to easily avoid this. For example, we have a compound X-CH2-CH3 and make replacement CH3>>NH2 and CH2-CH3>>CH2-NH2. They both will give X-CH2-NH2.

I'm not sure that GPU will help much, because the code extensively uses RDKit and there is no GPU support in RDKit (https://github.com/rdkit/rdkit/issues/3537) and separating GPU-friendly code seems too difficult.

From time to time I thought about substantial refactoring and change of the architecture and algorithms. But this will require a lot of efforts and may result in a new implementation :)

DrrDom commented 3 years ago

I put here results of tests and some conclusions. I currently do not consider speed up as a priority issue because we use relatively slow scorings and structure generation is not a limitation step. But for fast scorings this can be already an issue and if this can be improved by low cost we can do that.

The amount of duplicated generated structures is up to 65% per molecule that is too much that I expected.

The major function consuming 90% of runtime is __frag_replace. There are four most time consuming blocks (sorry, I just noticed that I run code on local crem version, so line numbers should be reduced on 13):

set_protected_atoms (line 212) - 25% of time
RunReactants (line 224) - apply reaction SMARTS - 10%
SanitizeMol (line 227) - it never returned errors but I think it is required to recalculate internal states of Mol object to avoid errors in future uses (do not remember exactly) - 10%
Chem.MolToSmiles(Chem.RemoveHs(p), isomericSmiles=True) (line 232) - convert to smiles to identify duplicates - 40%

Conversion to smiles takes most of the time. I expect that RemoveHs is also expensive because create a new Mol object. This step can be replaced with inchi to identify duplicates (could be faster and no need to address the issue of explicit hydrogens). But as a main output of all functions I use smiles and their generation is inevitable or we need to change the output of all functions to return not smiles but Mol objects and transfer responsibility to manage explicit hydrogens on a user. Changing the interface seems not a good idea but if this will speed up things greatly we can do that. Use SMILES as a major output probably was not a great idea.

set_protected_atoms looks quite complex and may be simplified to speed up.

remove sanitization step is questionable, because again this transfers responsibility to a user to manage generated Mol objects and solve issues with them.

Concluding remarks:

avoid generation of duplicates can substantially reduce runtime. I have no solution currently.
I would like to avoid changes in API to not break compatibility but if this will greatly speed up things I'm ready to do that.
I would not like to shift responsibility to a user to manage internal states of generated molecules and catch other errors. Just dislike to write repeatedly the same code and remember of these potential issues.

Alternative "solution". I also considered to completely refuse from generation of compounds through RDKit reactions and make replacements manually because ids of removing atoms are known and we need to take the remaining part and link new atoms. This will completely remove the first two points from the list (this may speed up things but not necessary), but sanitization and generation of smiles will probably remain. This will also require to change other parts of code (at least __fragment_mol to get ids of attachment points of the context and a whole remaining part).

Any suggestions/ideas are welcome.

DrrDom commented 3 years ago

Investigating issue with duplicates I found and fixed the bug of generation of duplicated fragments before their mutation. All fragments with 1 or 2 attachment points were duplicated, thus twice number of replacements were performed and they resulted in the same molecules. This fix gave x2 speed up. Now the number of duplicates in generated molecules is from 0 to 32% on test samples.

andrewdchen commented 3 years ago

Thanks for working on this! If I was only interested in the rdkit mol objects, would I able to get even more speed by not doing the Chem.MolToSmiles(...) conversion on line 221? Currently, I'm only using mutate_mol with the parameter return_mol activated.

DrrDom commented 3 years ago

Yes, this should speed up on 30-40%, but there will be about 20-25% (in average) of duplicated molecules in the output. If you are ok with them you may remove generation of smiles and change the further code accordingly.

I'm still thinking how to improve the situation, but I do not see a way to avoid substantial changes. The worst thing is that it is difficult to estimate whether these changes will speed up whole generation or not.

andrewdchen commented 3 years ago

Would it be faster to use fingerprints to find duplicates instead of SMILE strings?

DrrDom commented 3 years ago

I do not know you may try to measure performance. But fingerprints have collisions and thus some molecules will be falsely identified as duplicates. So some compounds will be erroneously discardred. The number of such false duplicates depends on fingerprint type and structures. I never explored this issue.

DhanshreeA commented 1 year ago

Hi @DrrDom, I am interested in incorporating CReM within the Ersilia model hub and based on this thread I am curious to know if configuring the min_size/max_size parameters in MUTATE (or similarly min_atoms/max_atoms in GROW and LINK) would also affect speed?

DrrDom commented 1 year ago

Hi @DhanshreeA, yes, the speed will depend on these parameters also. I expect that the percentage of duplicates will be similar for different setups. So, if you generate more compounds, more duplicates should be discarded. Thus, if your parameters increases the number of generated compounds the time spent on detection of duplicates will rise proportionally (in average).

Example for GROW. min_atoms=1 and max_atoms=8 may generate 1000 molecules (and 200 duplicates will be discarded internally). min_atoms=9 and max_atoms=9 may generate 500 molecules (and 100 discarded duplicates). min_atoms=12 and max_atoms=12 may generate 2000 molecules (and 400 discarded duplicates). Everything depends on how many fragment replacements can be found in the database.

Final choice of parameters depends on the practical use case.

DrrDom / crem

Parameters relevant for speed in mutate_mol #4