datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
462 stars 48 forks source link

New feature. Users can now use some RDKit FilterCatalog features. #178

Closed Pakman450 closed 1 year ago

Pakman450 commented 1 year ago

Hello,

I was shocked to learn that Datamol doesn't have filtering feature to find molecules that have PAINS hits (unless I am REALLY blind and I can't find it. If you do please direct me). So I added a filtering mechanism to discover any hits in a list of SMILES. The user can additionally specify what kind of RDKit filtering catalogs to use, which is what the catalog_specifiers is in the new code. Below is available RDKit catalogs also shown in this doc page:

ALL = rdkit.Chem.rdfiltercatalog.FilterCatalogs.ALL
BRENK = rdkit.Chem.rdfiltercatalog.FilterCatalogs.BRENK
NIH = rdkit.Chem.rdfiltercatalog.FilterCatalogs.NIH
PAINS = rdkit.Chem.rdfiltercatalog.FilterCatalogs.PAINS
PAINS_A = rdkit.Chem.rdfiltercatalog.FilterCatalogs.PAINS_A
PAINS_B = rdkit.Chem.rdfiltercatalog.FilterCatalogs.PAINS_B
PAINS_C = rdkit.Chem.rdfiltercatalog.FilterCatalogs.PAINS_C
ZINC = rdkit.Chem.rdfiltercatalog.FilterCatalogs.ZINC

Now, ALL is all of them in that list. So to make sure there are no duplicates in a user-defined catalog_specifiers, I made sure to uniquify a that list in set_filter_params. Further, the reason why set_filter_params is not in __init__.py is because I didn't want the user to use that fn to simplify data processing, as per the philosophy of datamol. If users were to specify the catalogs, it will automatically do it for you.

All simple fns are in matches.py. The filter.py has the fns that uses various fns in matches.py to either return a python dictionary of the results, or a pandas dataframe. Also, I added number of filter catalog hits in the compute.py because I think the user would want to know how many catalog hits there are for each of their molecule, which can be debated let me know what your thoughts are.

Thanks, Steven Pak

Checklist:


codecov[bot] commented 1 year ago

Codecov Report

Merging #178 (4f96abc) into main (b8531fa) will increase coverage by 0.20%. The diff coverage is 98.93%.

@@            Coverage Diff             @@
##             main     #178      +/-   ##
==========================================
+ Coverage   91.13%   91.33%   +0.20%     
==========================================
  Files          47       50       +3     
  Lines        3529     3623      +94     
==========================================
+ Hits         3216     3309      +93     
- Misses        313      314       +1     
Flag Coverage Δ
unittests 91.33% <98.93%> (+0.20%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
datamol/descriptors/descriptors.py 90.90% <ø> (ø)
datamol/filter/matches.py 98.21% <98.21%> (ø)
datamol/__init__.py 100.00% <100.00%> (ø)
datamol/descriptors/__init__.py 100.00% <100.00%> (ø)
datamol/descriptors/compute.py 100.00% <100.00%> (ø)
datamol/filter/__init__.py 100.00% <100.00%> (ø)
datamol/filter/filter.py 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Pakman450 commented 1 year ago

@hadim, apologies, I should have brought this up in the gh issues for this repo according to your changes in the PULL_REQUEST_TEMPLATE.md. Do you want me to close this PR and submit an issue about this feature?

hadim commented 1 year ago

Hello @Pakman450 and thank you for spending time on this! Unfortunately, we have already planned to release such filtering features in the future. We don't know yet whether it will be in datamol or in a separate package under the datamol organization.

I apologize that we will likely not be able to review and merge that PR. It wasn't explicitly stated, but in general, it's a good practice to open an issue before investing time into a PR. I have now modified the PR template to add a comment related to that: https://github.com/datamol-io/datamol/commit/d4f5b4c45351a1d25fcd4fa2d8e22abecb932668

Thanks again for your interest in datamol and I hope we'll see you again soon!

hadim commented 1 year ago

@hadim, apologies, I should have brought this up in the gh issues for this repo according to your changes in the PULL_REQUEST_TEMPLATE.md. Do you want me to close this PR and submit an issue about this feature?

It looks like we were writing at the same time :-) Thank you for your understanding!

Pakman450 commented 1 year ago

@hadim I understand. I am really sorry about that. I will do that next time!