datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
462 stars 48 forks source link

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

Closed Pakman450 closed 1 year ago

Pakman450 commented 1 year ago

Hello! This is for issue #114. So this edit is quite opinionated. I don't know if you guys would this kind of organized dataframe. But first I have to discuss where I added this change. I decided to add a optional flag for datamol users in the fuzzy_scaffolding function. The flag is termed if_df, which is defaulted to False. The flag is termed if_df, which is defaulted to True. Two separate pandas dataframe for each scf2infos and scf2groups will be returned. The rationale is not to confuse users on how fuzzy_scaffolding function would return previously.

NOTE: the output below is my best attempt to express the pandas dataframe. The output is just a df that has 3 columns and not 15 rows.

Let's start with scf2infos

The way scf2infos is structured is absolutely perfect to be pandas transposed. Every scaffold output has its corresponding rdkit mol list and its smarts pattern. This means every row will represent the scf and every scf will have its mol list.

output:

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                                    mols  \
index                                                      
0        [<rdkit.Chem.rdchem.Mol object at 0x1108a1a80>]   
1      [<rdkit.Chem.rdchem.Mol object at 0x1108a06d0>...   
2        [<rdkit.Chem.rdchem.Mol object at 0x1108a0200>]   
3        [<rdkit.Chem.rdchem.Mol object at 0x1108a2dc0>]   
4        [<rdkit.Chem.rdchem.Mol object at 0x1108a2c00>]   

                                                  smarts  
index                                                     
0      [#6]1:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@[#6]:&@1)...  
1      [#6]-&!@[#8]-&!@[#6]1:&@[#6]:&@[#6]:&@[#6](:&@...  
2      [#6](-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]-&!@[#1...  
3      [#6]1-&@[#6]-&@[#6]-&@[#6]-&@[#6]-&@[#7]-&@1-&...  
4      [#15](=,-;!@[#8,#7])(-&!@[#8]-&!@[#6]-&!@[#6]1...  

What about scf2groups?

scf2groups is trickier because the core 'groups' are in their individual dictionaries contained in a list. This can be difficult to create a df due to these multi-valued attributes. Thankfully, Pandas can control for the multi-valued attributes by calling from the .from_dict() method and setting orient to 'index'. Further, this allows the df for each scf row to have None values if there are no results. So this df will dynamically create n_core_groups if there is an output with n number of core groups. So in the case below, scf index 1 has two core groups but the rest do not.

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                            0_core_group  \
index                                                      
0      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
2      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
3      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
4      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   

                                            1_core_group  
index                                                     
0                                                   None  
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...  
2                                                   None  
3                                                   None  
4                                                   None  

Other than that. I hope you like the code. I also updated the test cases.

Thanks, Steven Pak

Checklist:


codecov[bot] commented 1 year ago

Codecov Report

Merging #188 (dc04ffa) into main (9bb6258) will increase coverage by 0.02%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   91.45%   91.47%   +0.02%     
==========================================
  Files          46       46              
  Lines        3661     3672      +11     
==========================================
+ Hits         3348     3359      +11     
  Misses        313      313              
Flag Coverage Δ
unittests 91.47% <100.00%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
datamol/scaffold/_fuzzy.py 90.44% <100.00%> (+0.71%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Pakman450 commented 1 year ago

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

maclandrol commented 1 year ago

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: https://github.com/datamol-io/datamol/issues/119

Pakman450 commented 1 year ago

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119

I changed the if_df defaulted to true, so it will always bring dataframes. Should I completely remove that flag and change to return type just to bring dataframes and no dictionaries?

hadim commented 1 year ago

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Pakman450 commented 1 year ago

Let's put the refactoring of the function itself in a different PR.

@Pakman450 can you document the column name in the dataframes ?

ping @hadim for planning refactoring later.

@maclandrol Where do you want me to document them? In the comments?

Pakman450 commented 1 year ago

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

Pakman450 commented 1 year ago

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

rebase complete.

maclandrol commented 1 year ago

@maclandrol Where do you want me to document them? In the comments?

In the docstring of the function, so it's rendered in the documentation.

maclandrol commented 1 year ago

LGTM, thanks @Pakman450