IBM / pychemex

Python library for Cheminformatics ML model explainability
Apache License 2.0
0 stars 1 forks source link

Calculate a lot of features for the sample dataset #5

Closed JJanowiak closed 3 years ago

JJanowiak commented 3 years ago

which features to be decided. will need to look at MP prediction papers and what's freely and easily available. Try grouping the features into categories based on what they intend to capture, so specific features could be "turned off" to show the functionality of the library.

Will probably split this into multiple issues later.

Alex-AMC commented 3 years ago

Suggested categories for features:

JJanowiak commented 3 years ago

2 csvs

  1. features calculations SMILES per row, feature per column

  2. feature category feature per row, category per column

JJanowiak commented 3 years ago

Contains a very poor image of potential categories: http://datascience.unm.edu/biomed505/Course/Cheminformatics/basic/descs_fingers/molec_descs_fingerprints.htm

categories are:

JJanowiak commented 3 years ago

https://www.rdkit.org/docs/source/rdkit.Chem.QED.html https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html https://www.rdkit.org/docs/source/rdkit.Chem.Lipinski.html

Alex-AMC commented 3 years ago

We could use CDK and a python wrapper To calculate a set of descriptors

Alex-AMC commented 3 years ago

SCINE - Molassembler has the ability to do molecular graphs and a bunch of other descriptors. Written in C++ but has a python key bindings available?

Alex-AMC commented 3 years ago

Descriptors to be saved as .CSV

Alex-AMC commented 3 years ago

Calculated descriptors using built in class for calculating all descriptors : rdkit.ML.Descriptors

Left computer over night to calculate 208 descriptors for 274,978 SMILEs.