Closed zhu0619 closed 1 year ago
Merging #201 (eb51b5b) into main (9bea940) will increase coverage by
0.61%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #201 +/- ##
==========================================
+ Coverage 91.48% 92.09% +0.61%
==========================================
Files 46 46
Lines 3664 3809 +145
==========================================
+ Hits 3352 3508 +156
+ Misses 312 301 -11
Flag | Coverage Δ | |
---|---|---|
unittests | 92.09% <100.00%> (+0.61%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
datamol/mol.py | 97.00% <100.00%> (+0.06%) |
:arrow_up: |
... and 6 files with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@zhu0619 from now the title of the PR will be used for generating the changelogs. I will rename it to make it more explicit.
In general salts in SMILES are identified as the smaller fragment in a multi-molecule string. E.g., c1ccccc1C[N+].Cl, benzylamine HCl salt, the salt is identified as the smaller fragment on the SMILES (each fragment/unit is separated by a "." in SMILES). So here the workflow usually takes the "Cl" away (sometimes stored in an another field to remember that the parent molecule was a salt, especially if coming from a commercial database) and the parent molecule is then protonated/neutralized/canonized, as required. If you have multiple salts, e.g. c1c([N+])cccc1[N+].Cl.Cl, bisaniline bis HCl, the process would remove all the smaller fragments (the two smaller Cl fragments) and leave the parent aniline. The assumption here is that the salt units are smaller than the parent molecule and that's true for 99.9+% of the cases when dealing with small drug-like molecules. In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.
-- Ivan Franzoni, Ph.D. Senior Scientist, Computational Chemistry Valence Discovery Inc. https://www.valencediscovery.com/ 6666 St-Urbain, Suite 200 Montreal, QC H2Z 3H1 Canada
On Wed, Jun 21, 2023 at 8:05 PM Emmanuel Noutahi @.***> wrote:
@.**** commented on this pull request.
In datamol/data/salts.smi https://github.com/datamol-io/datamol/pull/201#discussion_r1237839109:
+SC#N Thiocyanic acid +CI Methyl Iodide +OS(=O)O Sulfurous Acid +C1CCC(CC1)NC2CCCCC2 Dicyclohexylamine +OS(=O)(=O)C(F)(F)F Triflate +Cc1cc(C)c(c(C)c1)S(=O)(=O)O Mesitylene sulfonate +OC(=O)CC(=O)O Malonic acid +OS(=O)(=O)F Fluorosulfuric acid +CC(=O)OS(=O)(=O)O Acetylsulfate +[H] Proton +[Rb] Rubidium +[Cs] Cesium +[Fr] Francium +[Be] Beryllium +[Ra] Radium +C(=O)C(O)C(O)C(O)C(O)C(=O)O Glucuronate open form
Yes, we should have either the file or list of smarts/smiles optional. I have some doubt about the salt/solvent file definition used here. Ping @Fransu86 https://github.com/Fransu86 for review.
— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#discussion_r1237839109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK43EXTRZ3NPFV3U5J53XMOD57ANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>
In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.
That means in most of the case, user can simply use dm.keep_largest_fragment
. But for special cases, user need to define the salts/solvents need to be removed. In that case, they can use remove_salt_solvent
.
Exactly, sales and solvents usually are removed based on the relative size still.
On Thu, Jun 22, 2023, 10:45 Lu Zhu @.***> wrote:
In conclusion, more than a list of salts, it is better to identify the fragments in the SMILES because that's a very safe and generalizable way to deal with salts. Does it make sense? We can discuss better in person on Monday if necessary.
That means in most of the case, user can simply use dm.keep_largest_fragment. But for special cases, user need to define the salts/solvents need to be removed. In that case, they can use remove_salt_solvent.
— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1602769014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK4YNFDG347EPYTMYJTDXMRK7JANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>
The fragment matching performed by most processing (including RDKit) is to avoid the edge case where your largest fragment is not your molecule, but rather the salt/solvent (for e.g: https://pubchem.ncbi.nlm.nih.gov/compound/3038222)
Those are very rare cases still. I'd be more inclined to go more general and "sacrifice" a handful of very very rare cases rather then enumerate all the salts ever been reported in the literature/databases. Also, in many cases the same molecule may exists in different salt forms. In the example you shared, Dobutamine will still be saved from the free-base or smaller salt forms. Eventually when you strip solvents and salts you'll need to check for duplicates, or you want to carry the salt information as well? If you want to correlate, e.g., biochemical activities, solubilities of different salt forms of the same molecule, then you need to include the salt explicitly, if not then you don't really care.
On Thu, Jun 22, 2023, 11:43 Emmanuel Noutahi @.***> wrote:
The fragment matching performed by most processing (including RDKit) is to avoid the edge case where your largest fragment is not your molecule, but rather the salt/solvent (for e.g: https://pubchem.ncbi.nlm.nih.gov/compound/3038222)
— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1602870465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK45OYTUPQOKHFJ2IWP3XMRRYLANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>
I have committed changes based on all the above discussion. The use case is explained in the docstring.
User can use either keep_largrest_fragment
OR remove_salts_solvents
with default salt/solvent file OR with their own salt/solvent data as argument.
@Fransu86 I still need your help with reviewing the salts_solvents.smi
file, in case some units are wrongly defined.
File checked! It seems quite extensive even if from 2006. I noticed that there is a mention of copyright in it - better check hehehe
I think this file to filter salts should work fine. My suggestion could be to have a small general workflow that checks the following: 1) SMILES contains more than one molecule 2) Uses the list to remove the salt/solvent molecules 3) If there is no match between one of the fragments (when >1) in the list, just remove the smaller one.
Something like that should be the most safe and general approach to strip salts/solvents from databases.
On Thu, Jun 22, 2023, 14:14 Lu Zhu @.***> wrote:
I have committed changes based on all the above discussion. The use case is explained in the docstring. User can use either keep_largrest_fragment OR remove_salts_solvents with default salt/solvent file OR with their own salt/solvent data as argument.
@Fransu86 https://github.com/Fransu86 I still need your help with reviewing the salts_solvents.smi file, in case some units are wrongly defined.
— Reply to this email directly, view it on GitHub https://github.com/datamol-io/datamol/pull/201#issuecomment-1603112035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZABK44PFKI2Q3XP2T2H66DXMSDRJANCNFSM6AAAAAAZNOWYSM . You are receiving this because you were mentioned.Message ID: @.***>
Thanks @Fransu86 .
@hadim @maclandrol For the step 3 we can add an argument and apply keep_largest_fragment
for miss matches. Or is it better to keep it separated from the function?
Changelogs
This PR added molecule processing functions
remove_salts
for removing all the salts and solvents in the moleculeThe salts and solvents are defined in SMARTS in
datamol/data/salts_solvents.smi
.Checklist:
feature
,fix
ortest
(or ask a maintainer to do it for you).discussion related to that PR