OpenSourceMalaria / Series4_PredictiveModel

Can we Predict Active Compounds in OSM Series 4?
7 stars 10 forks source link

Raymond_Lui_OSM_submission #15

Closed luiraym closed 5 years ago

luiraym commented 5 years ago

Hello, here is my submission for the second round of the OSM Series 4 Prediction Challenge. Here is my methodology:

A collaboration was formed with Davy Guan to distribute the data preparation for descriptor calculation and model interpretation efforts. 340 unique molecules were collated into the training dataset following curation consisting of averaging the molecule potency values to remove duplicates. The SMILES structures were further curated in ChemAxon Standardizer with the removal of solvents and salts, neutralisation of any charged fragments, and addition of explicit hydrogens. Three-dimensional geometries were initially constructed using the UFF/MMFF94S forcefield. then further optimised using PM7 methodologies in the gas phase. 1,825 two- and three-dimensional physicochemical descriptors were calculated for the optimised structures using the Mordred descriptor calculator package. 21 three-dimensional electronic descriptors were calculated using the CPCM continuum solvation model after further HF-3c geometry optimisation. Permutation feature importance was used to select 50 physicochemical descriptors and 9 electronic descriptors that were determined as most relevant to modelling and predicting PfATP4 potency. QSAR models mapping the 59 descriptors to the concentration values were developed in TPOT, a genetic algorithm based method to optimise model hyperparameters. Six models were developed and ensembled by averaging their predictions.

All the models making up the final ensemble each featured an mean absolute error between 0.33 to 0.39 in 10-fold cross validation.