datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
452 stars 47 forks source link

Add two new functions to remove salts and solvents from a molecule #201

Closed zhu0619 closed 1 year ago

zhu0619 commented 1 year ago

Changelogs

This PR added molecule processing functions

The salts and solvents are defined in SMARTS in datamol/data/salts_solvents.smi.


Checklist:


discussion related to that PR

codecov[bot] commented 1 year ago

Codecov Report

Merging #201 (eb51b5b) into main (9bea940) will increase coverage by 0.61%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #201      +/-   ##
==========================================
+ Coverage   91.48%   92.09%   +0.61%     
==========================================
  Files          46       46              
  Lines        3664     3809     +145     
==========================================
+ Hits         3352     3508     +156     
+ Misses        312      301      -11     
Flag Coverage Δ
unittests 92.09% <100.00%> (+0.61%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
datamol/mol.py 97.00% <100.00%> (+0.06%) :arrow_up:

... and 6 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

hadim commented 1 year ago

@zhu0619 from now the title of the PR will be used for generating the changelogs. I will rename it to make it more explicit.

Fransu86 commented 1 year ago

In general salts in SMILES are identified as the smaller fragment in a multi-molecule string. E.g., c1ccccc1C[N+].Cl, benzylamine HCl salt, the salt is identified as the smaller fragment on the SMILES (each fragment/unit is separated by a "." in SMILES). So here the workflow usually takes the "Cl" away (sometimes stored in an another field to remember that the parent molecule was a salt, especially if coming from a commercial database) and the parent molecule is then protonated/neutralized/canonized, as required. If you have multiple salts, e.g. c1c([N+])cccc1[N+].Cl.Cl, bisaniline bis HCl, the process would remove all the smaller fragments (the two smaller Cl fragments) and leave the parent aniline. The assumption here is that the salt units are smaller than the parent molecule and that's true for 99.9+% of the cases when dealing with small drug-like molecules. In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.

-- Ivan Franzoni, Ph.D. Senior Scientist, Computational Chemistry Valence Discovery Inc. https://www.valencediscovery.com/ 6666 St-Urbain, Suite 200 Montreal, QC H2Z 3H1 Canada

On Wed, Jun 21, 2023 at 8:05 PM Emmanuel Noutahi @.***> wrote:

@.**** commented on this pull request.

In datamol/data/salts.smi https://github.com/datamol-io/datamol/pull/201#discussion_r1237839109:

+SC#N Thiocyanic acid +CI Methyl Iodide +OS(=O)O Sulfurous Acid +C1CCC(CC1)NC2CCCCC2 Dicyclohexylamine +OS(=O)(=O)C(F)(F)F Triflate +Cc1cc(C)c(c(C)c1)S(=O)(=O)O Mesitylene sulfonate +OC(=O)CC(=O)O Malonic acid +OS(=O)(=O)F Fluorosulfuric acid +CC(=O)OS(=O)(=O)O Acetylsulfate +[H] Proton +[Rb] Rubidium +[Cs] Cesium +[Fr] Francium +[Be] Beryllium +[Ra] Radium +C(=O)C(O)C(O)C(O)C(O)C(=O)O Glucuronate open form

Yes, we should have either the file or list of smarts/smiles optional. I have some doubt about the salt/solvent file definition used here. Ping @Fransu86 https://github.com/Fransu86 for review.

— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#discussion_r1237839109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK43EXTRZ3NPFV3U5J53XMOD57ANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>

zhu0619 commented 1 year ago

In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.

That means in most of the case, user can simply use dm.keep_largest_fragment. But for special cases, user need to define the salts/solvents need to be removed. In that case, they can use remove_salt_solvent.

Fransu86 commented 1 year ago

Exactly, sales and solvents usually are removed based on the relative size still.

On Thu, Jun 22, 2023, 10:45 Lu Zhu @.***> wrote:

In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.

That means in most of the case, user can simply use dm.keep_largest_fragment. But for special cases, user need to define the salts/solvents need to be removed. In that case, they can use remove_salt_solvent.

— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1602769014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK4YNFDG347EPYTMYJTDXMRK7JANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>

maclandrol commented 1 year ago

The fragment matching performed by most processing (including RDKit) is to avoid the edge case where your largest fragment is not your molecule, but rather the salt/solvent (for e.g: https://pubchem.ncbi.nlm.nih.gov/compound/3038222)

Fransu86 commented 1 year ago

Those are very rare cases still. I'd be more inclined to go more general and "sacrifice" a handful of very very rare cases rather then enumerate all the salts ever been reported in the literature/databases. Also, in many cases the same molecule may exists in different salt forms. In the example you shared, Dobutamine will still be saved from the free-base or smaller salt forms. Eventually when you strip solvents and salts you'll need to check for duplicates, or you want to carry the salt information as well? If you want to correlate, e.g., biochemical activities, solubilities of different salt forms of the same molecule, then you need to include the salt explicitly, if not then you don't really care.

On Thu, Jun 22, 2023, 11:43 Emmanuel Noutahi @.***> wrote:

The fragment matching performed by most processing (including RDKit) is to avoid the edge case where your largest fragment is not your molecule, but rather the salt/solvent (for e.g: https://pubchem.ncbi.nlm.nih.gov/compound/3038222)

— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1602870465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK45OYTUPQOKHFJ2IWP3XMRRYLANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>

zhu0619 commented 1 year ago

I have committed changes based on all the above discussion. The use case is explained in the docstring. User can use either keep_largrest_fragment OR remove_salts_solvents with default salt/solvent file OR with their own salt/solvent data as argument.

@Fransu86 I still need your help with reviewing the salts_solvents.smi file, in case some units are wrongly defined.

Fransu86 commented 1 year ago

File checked! It seems quite extensive even if from 2006. I noticed that there is a mention of copyright in it - better check hehehe

I think this file to filter salts should work fine. My suggestion could be to have a small general workflow that checks the following: 1) SMILES contains more than one molecule 2) Uses the list to remove the salt/solvent molecules 3) If there is no match between one of the fragments (when >1) in the list, just remove the smaller one.

Something like that should be the most safe and general approach to strip salts/solvents from databases.

On Thu, Jun 22, 2023, 14:14 Lu Zhu @.***> wrote:

I have committed changes based on all the above discussion. The use case is explained in the docstring. User can use either keep_largrest_fragment OR remove_salts_solvents with default salt/solvent file OR with their own salt/solvent data as argument.

@Fransu86 https://github.com/Fransu86 I still need your help with reviewing the salts_solvents.smi file, in case some units are wrongly defined.

— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1603112035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK44PFKI2Q3XP2T2H66DXMSDRJANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>

zhu0619 commented 1 year ago

Thanks @Fransu86 . @hadim @maclandrol For the step 3 we can add an argument and apply keep_largest_fragment for miss matches. Or is it better to keep it separated from the function?