ersilia-os / xai4chem

Basic explainable AI for QSAR chemistry models
GNU General Public License v3.0
0 stars 1 forks source link

Create a class to calculate molecular descriptors using `fit` and `transform` logic #3

Closed miquelduranfrigola closed 4 months ago

miquelduranfrigola commented 7 months ago

We should calculate molecular descriptors. The most important thing is that these descriptors are well known. I suggest that we use Datamol descriptors: https://docs.datamol.io/0.9.1/tutorials/Descriptors.html, in particular, this function:

# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)

We then have to define a Descriptor class that has a fit and a transform method. At fit time, constant values should be removed, missing values imputed and, most importantly, all values should be normalized or scaled somehow. The following code from another Ersilia project may help:

It would be nice if a save and a load method are included as well. We can use joblib for this.

HellenNamulinda commented 7 months ago

Hi @miquelduranfrigola, By default, datamol's compute_many_descriptors (single molecule) and batch_compute_many_descriptors (list of molecules) computes 22 opiniated molecular properties.

mw
fsp3
n_lipinski_hba
n_lipinski_hbd
n_rings
n_hetero_atoms
n_heavy_atoms
n_rotatable_bonds
n_radical_electrons
tpsa
qed
clogp
sas
n_aliphatic_carbocycles
n_aliphatic_heterocyles
n_aliphatic_rings
n_aromatic_carbocycles
n_aromatic_heterocyles
n_aromatic_rings
n_saturated_carbocycles
n_saturated_heterocyles
n_saturated_rings

I see these can be a good number of features. But do you think these features are enough, or we can add other properties to calculate. I can modfify the Descriptor class(https://github.com/ersilia-os/xai4chem/pull/5) to cater for that in the pipeline.

miquelduranfrigola commented 7 months ago

Hi @HellenNamulinda , let's start with these 22 features. Most likely, it won't be enough to achieve the best-possible models, but let's start here. As discussed, let's bring it to the end, and then we will go back to this point and do some feature engineering! But for now, let's work based on this.

HellenNamulinda commented 6 months ago

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features). To improve the performance of the model, we will

That's why this issue will be be open until we have explored the different descriptors.

miquelduranfrigola commented 6 months ago

Thanks @HellenNamulinda - great summary.

HellenNamulinda commented 5 months ago

While it is possible to increase the number of features in datamol, it can be quite handy when calculating more properties. It requires to pass a dictionary(properties_fn) containing functions(from rdkit) for all the additional features/descriptors to be calculated.

So, we have decided to instead use RDKit to get all the descriptors. Also, Mordred Descriptors were added. This gives a chance to choose which descriptors to use(Choose between Datamol, Mordred or the RDkitClassical Descriptors).

With more features from rdkit and mordred, FeatureWiz is now used to select the top best features.

PR: https://github.com/ersilia-os/xai4chem/pull/9

miquelduranfrigola commented 5 months ago

This sounds good, @HellenNamulinda , thanks so much. This gives us a good starting point with three different types of descriptors:

HellenNamulinda commented 4 months ago

Just an update on this issue.

We are first bringing the pipeline to an end, first using the datamol descriptors(22 features). To improve the performance of the model, we will

  • [x] Increase the number of features(properties calculated by datamol descriptor class).
  • [x] Try other descriptors such as mordred descriptors
  • [x] And perhaps the morgan fingerprints.

That's why this issue will be be open until we have explored the different descriptors.

The Morgan Fingeprint Features were integrated in this commit(https://github.com/ersilia-os/xai4chem/pull/11/commits/d2ff5800b5db308fa8ab56be4ea9ae081b4e91e6)