MorganCThomas / MolScore

An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design
MIT License
148 stars 24 forks source link

Draft: align MolOpt functions with GuacaMol implementation #47

Open AustinT opened 5 months ago

AustinT commented 5 months ago

My testing showed that MolScore's version of some of the MolOpt (GuacaMol) functions differs from that of the original package. This PR applies some (partial) fixes.

  1. Fingerprint size: the original GuacaMol functions use a $2^{32}$ bit (sparse) fingerprints, whereas the implementations in MolScore use a 1024 bit (dense) fingerprint. This causes all similarity values to be slightly overestimated. I increased the fingerprint length to 16384 to reduce the overestimation.
  2. Fix Deco hop threshold: the implementation here used a threshold of 0.75, but the paper and code use 0.85.
  3. Align some MPO modifiers with GuacaMol's code in instances where paper and code differ. I didn't realize that GuacaMol's official implementation of their functions differs from the paper in several places. In these instances, I think it is better to side with their official code, because this is what prior works have used.
    • Fexofenadine_MPO: the paper specifies an STD of 2 for the TPSA and logP modifies, but the code uses 10 and 1 respectively. This causes scores to generally be a lower in GuacaMol's implementation compared to MolScore.
    • Osimertinib_MPO: sigmas are also different (original code)
  4. Fixed SMARTS string in Scaffold hop (in response to #45 )
  5. Fixed bug in isomer similarity calculation: the GuacaMol paper and code specify that the element-wise difference is taken with respect to elements in the target molecule only, while MolScore's implementation uses both the target and query molecules. A simple 1-line change fixes this.
  6. Fixed fingerprint lengths in legacy QSAR: JNK3 and GSK3B both use radius 2 fingerprints, not radius 3.

Also, I removed a spurious resource called molscore.configs.MolOpt-DF which does not exist in the repo (maybe it exists for you locally??)

Currently I have this PR marked as a draft. This is because

  1. The functions still don't seem to be 100% aligned (I still found small differences in Sitagliptin MPO, Valsartan SMARTS, and scaffold hop)
  2. If these fixes are accepted, they should be made in other places in the repo.