A library to build QSAR models fastly
git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .
You can find example data in the fantastic Therapeutic Data Commons portal.
from tdc.single_pred import Tox
data = Tox(name = 'hERG')
split = data.get_split()
Here we are selecting the hERG blockade toxicity dataset. Let's refactor data for convenience.
# refactor fetched data in a convenient format
smiles_train = list(split["train"]["Drug"])
y_train = list(split["train"]["Y"])
smiles_valid = list(split["valid"]["Drug"])
y_valid = list(split["valid"]["Y"])
Now we can train a model based on Morgan fingerprints.
import lazyqsar as lq
model = lq.MorganBinaryClassifier()
# time_budget (in seconds) and estimator_list can be passed as parameters of the classifier. Defaults to 20s and all the available estimators in FLAML.
model.fit(smiles_train, y_train)
from sklearn.metrics import roc_curve, auc
y_hat = model.predict_proba(smiles_valid)[:,1]
fpr, tpr, _ = roc_curve(y_valid, y_hat)
print("AUROC", auc(fpr, tpr))
Currently, only Morgan Descriptors and Ersilia Embeddings are available for regression models
You can find example data in the fantastic Therapeutic Data Commons portal.
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
Here we are selecting the Acute Toxicity dataset. Let's refactor data for convenience.
# refactor fetched data in a convenient format
smiles_train = list(split["train"]["Drug"])
y_train = list(split["train"]["Y"])
smiles_valid = list(split["valid"]["Drug"])
y_valid = list(split["valid"]["Y"])
Now we can train a model based on Morgan fingerprints.
import lazyqsar as lq
model = lq.MorganRegressor()
# time_budget (in seconds) and estimator_list can be passed as parameters of the regressor. Defaults to 20s and all the available estimators in FLAML.
model.fit(smiles_train, y_train)
from sklearn.metrics import mean_absolute_error, r2_score
y_hat = model.predict(smiles_valid)
mae = mean_absolute_error(y_valid, y_hat)
r2 = r2_score(y_valid, y_hat)
print("MAE", mae, "R2", r2)
The pipeline has been validated using the Therapeutic Data Commons ADMET datasets. More information about its results can be found in the /benchmark folder.
This library is only intended for quick-and-dirty QSAR modeling. For a more complete automated QSAR modeling, please refer to Zaira Chem
Learn about the Ersilia Open Source Initiative!