🦠 Model Request: Growth inhibitors of Neisseria gonorrhoeae

Richiio commented 10 months ago

Model Name

Growth Inhibitors of Neisseria gonorrhoeae

Model Description

The authors curated a dataset of 282 compounds from ChEMBL, of which 160 (56.7%) were labelled as active N. gonorrhoeae inhibitor compounds. They used this dataset to build a naïve Bayesian model and used it to screen a commercial library. With this method, they identified and validated two hits. We have used the dataset to build a model using Ersilia’s set of AI modelling tools.

Slug

ngonorrhoeae-inh

Tag

Antimicrobial activity, ChEMBL

Publication

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8274436/

Source Code

https://github.com/ersilia-os/lazy-qsar

License

GPL-3.0

GemmaTuron commented 10 months ago

Hi @Richiio I've updated the metadata, have a look! To work on this, the best will be to use a notebook - you can add it in the same repository so we can discuss model performances. I suggest looking at our training material to understand how to get started, we have many examples of the lazy-qsar package application

GemmaTuron commented 10 months ago

/approve

github-actions[bot] commented 10 months ago

New Model Repository Created! 🎉

@Richiio ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos5cl7

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

Richiio commented 10 months ago

Hi @GemmaTuron @HellenNamulinda Using the publication link, the authors didn't include the datasets used via the publication link given. They did however, provide how they got their datasets which was from ChemBL and the process they went through in order to get their training dataset which can be seen from this gotten from their publication

The model training set was extracted from ChEMBL (17) (https://www.ebi.ac.uk/chembl/) by searching for whole-cell growth inhibition data versus N. gonorrhoeae with the query keywords “Neisseria gonorrhoeae” and “MIC”. The curation process consisted of manually inspecting the dataset to remove duplicates, selecting a conservative MIC value of 8.0 μg/mL, converting as necessary MIC units from μM to μg/mL, as well as removing false positives, which are compounds that were incorrectly deposited in ChEMBL as actives against N. gonorrhoeae. Subsequently, a manual inspection of the 2D structures of all active compounds was performed to detect reactive and Pan Assay INterference compoundS (PAINS) (18, 19), which we then excluded from the training set in a process called “structural pruning” – an extension of the data pruning approach described in our previous work (20, 21). The final dataset contains 282 compounds, of which 160 (56.7%) were labelled as active compounds

The supplementary data they provided is shown below:

NIHMS1715097-supplement-Supplemental_file.docx

This file contains analysis they performed and explanations to enable us understand what they did. However, it contains no dataset file, although we were meant to have datasets saved as sdf files here

I was able to find some of the datasets on another site where they released their publication. This can be found here: https://link.springer.com/article/10.1007/s11095-020-02876-y. But I was expecting 6 dataset files as found in doc file provided above but I got five which wasn't named correctly. However, when I carried out some data analysis of the files, I have an inituition which file was used for data analysis based on the publication. The notebook can be found here

https://colab.research.google.com/drive/1Q8N8eam513A3NViTfiVqQypZJqOVE93q?usp=sharing

GemmaTuron commented 10 months ago

Hi @Richiio !

Indeed, that is the original publisher website, pubmed is just an aggregated of the publications indexed - so you got the data correctly https://link.springer.com/article/10.1007/s11095-020-02876-y#Sec6 The work so far is good, could you please move the notebook to the github repo, and add the files there as well under a folder /data? so we can have everything in the same place.

Next you should clean up the training file to end up with the list of compounds (smiles) and their activity for N.gonorreahe. With that, we can start training the model. But best to do everything on github directly.

Richiio commented 10 months ago

Hi @GemmaTuron @HellenNamulinda I have updated my repo on tasks done so far. A brief summary:

I was able to get the SMILES of all compounds from their IUPAC names
A processed data file containing SMILES and Activity
A brief data analysis of the file
Checked to see there is an almost equal distribution of Active and Inactive so our model performs well for both predictions
Using the notebook for lazy qsar training. I ran all cells in Ersilia's m3_building_a_model file which performed poorly on my dataset. The notebook and it's result is contained there

To-do

A new notebook to be updated to improve model performance (the example used there was for a regression task with a cut-off, where we ended up having higher and lower values with 1 and 0 respectively). Our dataset provided us with that result
I currently have 265 rows of data which was split for training and testing. The number is quite poor but that was what was given. Maybe check Chembl for more data

Suggestions

The authors reported the Naive Bayes classifier as what gave them the best results but we are using xgboost. I haven't compared performances between both yet

GemmaTuron commented 10 months ago

Thanks for the update and explanations @Richiio, can you share the links to the notebooks?

Richiio commented 10 months ago

Hi @GemmaTuron Apologies for that, I thought I included that in the repo:

https://colab.research.google.com/drive/1a28Kro03Bfy2SPzTnh6H-Ymkeq-l5u-P?usp=sharing

GemmaTuron commented 10 months ago

Hi @Richiio

Please update all code and data to the repository and then let's start by going back to the data processing. How do you obtain the SMILES from the IUPAC names? I see you are trying on one hand the STOUT package and on the other to get the SMILES from ChEMBL? training_data['SMILES'] = training_data['Molecule Name'].apply(lambda x: get_smiles_from_chembl(x)) I am seeing the same smiles all over, not different ones. Please clean up the data processing notebook so that is easier to follow. Focus only on the first file, not the rest.

Richiio commented 10 months ago

Hi @GemmaTuron I initially wanted to get the SMILES directly from CheMBL using an API, but as you rightly mentioned I was getting the same results all over, which wasn't the output I wanted.

So, instead of doing that I decided to use the SMILES2IUPAC package to get my SMILES, initially, there were some IUPAC names, it couldn't provide accurate SMILES for, specifically those that has curly brackets: "{" and not the normal square brackets "[". It threw an error for this. As seen below: N-{4-[2-(pyridin-3-yl)-1,3-thiazol-4-yl]phenyl}acetamide I handled this by looking through the data as it was small and changing the { to [ . However, if there were still some inputs I had overlooked, I asked it to input a different value for this such as None and checked for those manually on CheMBL, then added it to my dataset.

We also had the case of an input such as this as the IUPAC name: (1R,3S,5R,8R,10R,11S,12S,13R,14S)-8,12,14-trihydroxy-5-methyl-11,13-bis(methylamino)-2,4,9-trioxatricyclo[8.4.0.0³,⁸]tetradecan-7-one

For these inputs, I checked for their SMILES directly from CheMBL

I just observed something I had missed previously: Not all IUPAC are loaded when I ran: training_data = PandasTools.LoadSDF(File) The log is in the notebook

GemmaTuron commented 10 months ago

Hi @Richiio

Good, as always please provide numbers - how many names are not loaded from the sdf file? how many needed to be manually fetched? If there are things you tried and did not work please do not keep them in the final notebooks as it makes difficult to follow the process. Update all the code and data to github (and make sure to point to the right paths for loading the data etc) so we can have a closer look

Richiio commented 10 months ago

Hi @GemmaTuron 18 of these molecules were not loaded as I got an error for 18 of these rows. This checks out the missing values in the dataset. I have 264 rows of data but the publication mentioned they trained on 282 rows of data. Here is the error log:

training_data = PandasTools.LoadSDF(File) #Not all my molecules are parsed. The error I referenced in my issue.

[09:42:47] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13064
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13231
[09:42:47] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13320
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13419
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13507
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13592
[09:42:47] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13678
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 11 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13764
[09:42:47] ERROR: Explicit valence for atom # 11 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13848
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Warning: ambiguous stereochemistry - zero final chiral volume - at atom 20 ignored
[09:42:47] Explicit valence for atom # 10 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 18250
[09:42:47] ERROR: Explicit valence for atom # 10 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 21494
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 21574
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23764
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 9 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23844
[09:42:47] ERROR: Explicit valence for atom # 9 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 18 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23962
[09:42:47] ERROR: Explicit valence for atom # 18 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 5 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 31515
[09:42:47] ERROR: Explicit valence for atom # 5 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 17 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 32931
[09:42:47] ERROR: Explicit valence for atom # 17 N, 4, is greater than permitted

Richiio commented 10 months ago

For the notebooks: My first train, although it had a better AUC-ROC curve score (0.83), it was only actively predicting active molecules and inaccurately predicting inactive molecules as can be seen in the confusion matrix here: https://github.com/Richiio/eos5cl7/blob/main/Notebooks/m3_building_a_model.ipynb We had this result initially even though the active and inactive balance was preserved. I hadn't tried other hyperparameters here

This was my best so far in the four notebooks I ran. I had an AUC-ROC curve score of 0.73, it was able to predict 15 active and inactive molecules correctly. Here is the result when I calculated for accuracy, precision, f1-score and recall: Accuracy: 66.67% Precision: 68.18% Recall: 65.22% F1-Score: 66.55% Which was the best it has done so far. https://github.com/Richiio/eos5cl7/blob/main/Notebooks/2nd_train_no_change.ipynb

The rest of the notebooks are in my repo. https://github.com/Richiio/eos5cl7/tree/main/Notebooks

Since, I used SMILES2IUPAC to have my conversion, I am not confident about the accuracy of the SMILES it brought up given the inconsistency in the IUPAC representation of the data.

Richiio commented 10 months ago

The model for the second notebook has been saved pending improvements

GemmaTuron commented 10 months ago

Hi @Richiio

Thanks for this. We will go over them live in tomorrow's meeting if we have time!

HellenNamulinda commented 10 months ago

Hello, @Richiio After a further look at the data columns, we don't have to manually look up for SMILES. The data includes a column named ROMol (Read-Only Molecule), which is an instance of the ROMol class in RDKit. This column can be used directly to obtain SMILES representations.

If you specify the SMILES column name using the smilesName parameter (in this case, 'SMILES') when loading the file, the PandasTools will automatically add the SMILES column to the DataFrame.

raw_data = PandasTools.LoadSDF(File, smilesName='SMILES')# Create a pandas DataFrame with 'SMILES' column.
#If Smiles Column isn't specified while loading, it can be added by raw_data['SMILES'] = data['ROMol'].apply(lambda mol: Chem.MolToSmiles(mol) if mol is not None else None)

Meanwhile, their naive Bayesian model trained using ECFP6 fingerprints showed somewhat better results with AUC ROC = 0.8470, recall = 0.7736, precision = 0.8200, specificity = 0.7769, F1 score = 0.7961, CK = 0.5456, and MCC = 0.5467. This performance difference is to be attributed to the dataset size discrepancy.

The warning about the molecule being tagged as 2D, but with at least one Z coordinate not zero, might have led to the failure to load the 18 molecules. Their paper mentions a manual inspection of 2D structures to detect reactive and PAINS compounds, which were then excluded.

So, we have to treat the training data provided as raw data and do follow the steps they did, including removal of salts.

HellenNamulinda commented 10 months ago

@GemmaTuron, I understand using the dataset used in the original paper is necessary for direct comparison and validation.

But incase it's permissible, the chembltools retrieved assays for Neisseria gonorrhoeae from the ChEMBL database(chembl_33). It can be an opportunity to incorporate additional data into the training set, considering that updates may have occurred post-2021, when the original paper was published.

In the original paper, the authors manually inspected the dataset, selected a conservative MIC value of 8.0μg/mL( with unit conversions). We could consider a similar approach with the newly acquired data if it is permitted to extend the dataset. The problem might be with the removal of false positives(compounds incorrectly labeled as actives against N. gonorrhoeae in ChEMBL)

Richiio commented 10 months ago

@HellenNamulinda @GemmaTuron Using the command @HellenNamulinda suggested, I was able to get the SMILES column, which has validated the accuracy of our SMILES. This error still persisted though which was to be expected as seen below:

[21:52:45] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13064
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13231
[21:52:45] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13320
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13419
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13507
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13592
[21:52:45] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13678
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 11 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13764
[21:52:45] ERROR: Explicit valence for atom # 11 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13848
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Warning: ambiguous stereochemistry - zero final chiral volume - at atom 20 ignored
[21:52:45] Explicit valence for atom # 10 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 18250
[21:52:45] ERROR: Explicit valence for atom # 10 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 21494
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 21574
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23764
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 9 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23844
[21:52:45] ERROR: Explicit valence for atom # 9 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 18 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23962
[21:52:45] ERROR: Explicit valence for atom # 18 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 5 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 31515
[21:52:45] ERROR: Explicit valence for atom # 5 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 17 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 32931
[21:52:45] ERROR: Explicit valence for atom # 17 N, 4, is greater than permitted

So we can conclude that we would be working with only 264 molecules against the 282 used for training by the original authors

Richiio commented 10 months ago

New results from model training with the new dataset

AUC-ROC curve: 0.81

The following results were gotten: Accuracy: 79.25% Precision: 75.86% Recall: 84.62% F1-Score: 79.89%

Richiio commented 10 months ago

@GemmaTuron @HellenNamulinda This is much better than the results I got previously.

The model has been saved

GemmaTuron commented 10 months ago

@HellenNamulinda that was a very good suggestion many thanks! @Richiio now that you are obtaining more consistent results, try to reformat your code as we were discussing today, add the folders under the /framework and have one notebook per task (data processing, model training) you can save the images of the ROC curves for example (5-fold cross validated) in a /figures folder for easy review.

Richiio commented 10 months ago

Hi @GemmaTuron I've reformated based on the suggestions you made in the meeting. Currently working on resolving the error I am getting when incorporating to the hub

TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type ExplicitBitVect

Richiio commented 10 months ago

I have opened a pull request for the model.

GemmaTuron commented 10 months ago

Hi @Richiio

I've merged the PR but before I work on your code to provide feedback please:

Make sure to clean up the notebooks so they are easier to follow, add comments and explain your rationale, particularly in the model training notebook
I do not see the 5 fold cross validation I've been asking for the last days. Please work on that
Make sure to direct the paths to the right folders, as they are now not well indicated. I suggest using the os package and specifying the DATAPATH
Please do not use !pip install in the notebooks. Let me know once these changes are made so me and @HellenNamulinda can review the work.

Richiio commented 10 months ago

Hi @GemmaTuron @HellenNamulinda I got an error when retraining with the 5fold cross-validation. This error is the same that was produced by the Ersilia's model hub when I made a pull request on the model.

The notebook with the error can be found on my github profile here https://github.com/Richiio/eos5cl7/blob/main/model/framework/code/02_Model_training.ipynb

GemmaTuron commented 10 months ago

Hi @Richiio The error you are getting is related to RDKIT processing of the smiles, so I suggest you look closer at what you are inputting to rdkit. It is not related to the cross validation.

ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.GetHashedMorganFingerprint(NoneType)
did not match C++ signature:
    GetHashedMorganFingerprint(RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)

In your notebook, I don't understand this bit of code:

        # Convert Morgan fingerprints to SMILES strings
        smiles_train = ["".join(map(str, x)) for x in X_train]
        smiles_test = ["".join(map(str, x)) for x in X_test]

Why do you convert back Morgan fps to smiles? This seems to be the issue as the smiles_train that you are passing does not seem to be in the correct format. Make sure to use print statements to see what you are passing in the function. Also remember that 60 seconds is not enough for an autoML algorithm and we would like to see what happens with longer times, and other descriptors as well, not only morgan. Work on these changes and let me know the outcome

Richiio commented 10 months ago

@GemmaTuron When I had multiple notebooks, I was training with Morgan and ersilia embeddings and on each train iteration, Morgan performed better than ersilia's embeddings. I decided to work strictly with Morgan but I would test with ersilia for the 5fold and get back to you

GemmaTuron commented 10 months ago

Hi @Richiio We are exploring the space and trying to find the best model, so do not delete any until we decide what works best. Please write a good summary here of the different results once you complete the different trainings with the improvements I suggest above. Include things like how many tests have you done, the performance of each model, which model would you choose as best, how we could validate it...

Richiio commented 10 months ago

Hi @GemmaTuron I've been unable to train any models due to an error with rdkit when trying to get molfromSmiles. The original maintainer decribed a bug which can be found here: "https://github.com/rdkit/rdkit/issues/7036"

When just running the initial notebook from Ersilia. The error I described above persisted

ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(NoneType, int, int)
did not match C++ signature:
    GetMorganFingerprintAsBitVect(RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)

HellenNamulinda commented 10 months ago

Hello @Richiio, Just like @GemmaTuron asked, what exactly is this code doing?

   # Convert Morgan fingerprints to SMILES strings
  smiles_train = ["".join(map(str, x)) for x in X_train]
  smiles_test = ["".join(map(str, x)) for x in X_test]

Remember when you convert to Morgan fingerprints, you end up with binary vectors(0s and 1s). So, that code will give you smiles as strings of 0s and 1s(which is wrong)

I don't understand why you are generating morgan fingerprints( and then convert them to smiles) since you are using lazy-qsar. Lazy qsar expects smiles inputs and it will generate the corresponding descriptors based on what you choose forexample lq.MorganBinaryClassifier(for Morgan Fingerprints), lq.MordredBinaryClassifier(for Mordred descriptors) or lq.ErsiliaBinaryClassifier(for ersilia embeddings).

Check your previous notebook, you were doing something like below

# Filter out rows with None or invalid SMILES
train = train.dropna(subset=['SMILES'])
train = train[train['SMILES'].apply(lambda x: Chem.MolFromSmiles(x) is not None)]

# Separate features (SMILES) and target variable (EXP)
smiles_train = train[SMILES]
y_train = train[EXP]

# Initialize and fit the model
model = lq.MorganBinaryClassifier(time_budget_sec=60, estimator_list=["rf", "lgbm", "xgboost"])
model.fit(smiles_train, y_train)

The fingerprints you are generating with the function preprocess_and_generate_fingerprints can be used to train a model from scratch(say naive bayes), but not as inputs to lazy-qsar model.

Richiio commented 10 months ago

Hi @HellenNamulinda That was an error with my comments. That is just a check to ensure that the Smiles I am passing to the convert to Morgan fingerprints are in strings.

This error I showed above still occurs when I am just running the example notebook that was provided. The error has something to do with the rdkit model. An issue was raised by someone on Friday concerning that

In the function cross_validation, I am converting to Morgan fingerprints by calling the convert_to_morgan_fingerprints, then performing mod training

Richiio commented 10 months ago

Hi @HellenNamulinda That was an error with my comments. That is just a check to ensure that the Smiles I am passing to the converted to Morgan fingerprints are in strings.

This error I showed above still occurs when I am just running the example notebook that was provided. The error has something to do with the rdkit model. An issue was raised by someone on Friday concerning that

In the function cross_validation, I am converting to Morgan fingerprints by calling the convert_to_morgan_fingerprints, then performing mod training

I am getting the error from this line Chem.MolFromSmiles(x). It doesn't generate 0's and 1's as it should

GemmaTuron commented 10 months ago

Hi @Richiio

The issue you are pointing in RDKIT has nothing to do with your error, or none that I can see directly. Can you please describe why you are pointing to this bug as the source of your error?

The function MolFromSmiles(smiles) generates a molecule object in RDKIT and requires that you are passing a SMILES string. The GetMorganFingerprintAsBitVect requires a molecule Please revise the code and make sure the x you are passing is a SMILES string and then a molecule to the next function. There is plenty of erro information if you search google.

The line that should give you the error tip is: rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(NoneType, int, int)

Richiio commented 10 months ago

@GemmaTuron I am getting this error because I have a nonetype in mol when I am meant to be having a vector as in [[0, 1, 0], [1,0,0]]. Something like this but instead I am passing in a None as my smiles wasn't converted to a mol by the rdkit function.

I had the same error when I just ran the example notebook provided by Ersilia. Not my personal notebook.

GemmaTuron commented 10 months ago

@Richiio

The bug report in RDKIT does not have anything to do with the issue you describe, from what I understand. Please explain that to me so I can provide further comments. You need to modify the code to make sure you are not passing a None object. Do not trust the code as it is, work on it to make sure it works.

Richiio commented 10 months ago

@GemmaTuron @HellenNamulinda Model Training results with a 5Fold cross-validation

Model Training	Accuracy	ROC-AUC score	Precision	Recall
Morgan Fingerprints, Train time (600)	79.55	0.81	78.26	81.82
Morgan Fingerprints, Train time (60)	84.09	0.85	77.78	95.45
Ersilia Embeddings, Train time (60)	70.45	0.83	68.00	77.27
Ersilia Embeddings, Train time (600)	77.27	0.83	75.0	81.82

Richiio commented 10 months ago

@GemmaTuron

I was getting the NoneType error as I didn't handle cases where we can't get the mol for a given SMILE. I did that in my previous notebook as valid_mols.

Which iteration should I incorporate to Ersilia?

Richiio commented 10 months ago

@GemmaTuron It's been updated

GemmaTuron commented 10 months ago

Hi @Richiio sorry but I don't see the updates in the repo

Richiio commented 10 months ago

Accuracies are updated in the table above. Repo would be updated by 7pm. My laptop is currently dead.

GemmaTuron commented 10 months ago

Hi @Richiio

The notebooks are looking much better now, thanks for the work. What I cannot yet see is the 5 fold cross validation. If you do not understand what I mean by that please just ask. Use the random split function to save 5 random splits of the data, and then train 5 models per each of the descriptor+method combination, and plot the ROC curves as the average of the 5 folds - you can ask @leilayesufu for help in this line because she also did it for her model.

As we discussed in the meeting, there is no need to save yet the models because we will only save the final model trained with all the data, so you can delete the ones trained on train/test splits.

The next steps would be:

Prepare the 5 train/test split sets
Train the 5 models and plot the AUROCs as the average +/- sd
Decide which model is the best and use that combination of descriptors and method to train a final model with all the data
Incorporate the model in the hub

Richiio commented 9 months ago

Good evening @GemmaTuron I've made a pull request with all the corrections above. Waiting for review

GemmaTuron commented 9 months ago

Hi @Richiio

Good, but before merging the code, can you summarise the results, and what is your final choice? Take into account the AUROC+- stdev Also, please consider redoing the work with the fixed train and test sets, as we were discussing in the meeting and I pointed on the issue above, save 5 train_test split files so that all the models are built on the same train test folds, and please also save the aurocs in the figures folder so it is easy to check the results. I am unsure what the last part of the notebook is doing, after the 5 fold model training. What would be your next steps now?

Richiio commented 9 months ago

@GemmaTuron What the last notebook is doing? Our dataset is quite small, so the last part of the notebook, we trained our model with the whole dataset, that is without the train-test split for the different models. For testing, it wasn't necessary introduced to new data it hadn't seen.

Next step:

Save 5 train-test split, and use this files for the 5Fold cross validation
Retrain the different models with this split
Based on accuracy and ROC-AUC curve, recommend the model to use and why?

GemmaTuron commented 9 months ago

Hi @Richiio

Thanks, the plan sounds good. For the last part of the notebook: we only need to do it with the best model, the one we want to save for use. It does not make sense to evaluate this model performance unless we have an external dataset (I think @HellenNamulinda pointed to some datasets in ChEMBL) The last step, let's not forget about it will be to actually incorporate the model, once is in the checkpoints folder we need to modify the code in main.py to make sure we are loading it properly, and revise the dependencies in Dockerfile.

Richiio commented 9 months ago

Alright, I'm working on that now

GemmaTuron commented 9 months ago

Hi @Richiio

I have a few comments on your notebooks, before incorporating the model please help me understand:

What are the train_sets and testsets lists? You are appending the same train test more than once inside each for loop
What is this function doing: tprs.append(np.interp(mean_fpr, fpr, tpr)) - where are you getting the mean_fpr from inside the for loop? I am particularly concerned for this point since you are using the tprs list down to calculate the mean auc
What are you using the roc_curve for? fpr, tpr, thresholds = roc_curve(y[test], y_hat_proba)
Where are the all_true_labels being used?
The Contingency table is done just using an example fold? If so, which one?

I think you have legacy code in each cell that is not performing any function. Please clean it up and make sure that you are averaging the right numbers (you can work on a single example of 60 seconds and once we are happy with it we extend it to the rest of the validations).

Richiio commented 9 months ago

@GemmaTuron

In the current notebook, the train_sets and test_sets have no effect (Cleaning that up), it was for a previous iteration
The mean_fpr is gotten from this code defined outside the function np.linspace(0, 1, 100). It represents a range of false positive rates (fpr) that will be used to interpolate the true positive rates (tpr) obtained from the ROC curve for each fold. For the 5fold, we have different pairs of false positive rate and true positive rate. So the code np.interp(mean_fpr, fpr, tpr) is used to interpolate the true positive rates (tpr) at the points specified by mean_fpr. This ensures that you have TPR values at the same false positive rate points across these folds.
The function takes the true labels (y[test]) and the predicted probabilities (y_hat_proba) as input and returns the False Positive Rate (fpr), True Positive Rate (tpr), and decision thresholds (thresholds). This was later used to plot the ROC curve and calculate the Area Under the Curve (AUC).
I used it in order to calculate the average classification report across all the folds. It is used to get all the true predictions for each fold.
The contigency table was for just one fold. An oversight on my part.

GemmaTuron commented 9 months ago

@Richiio

I've done a bit of code refactoring:

The data for model training should be inside processed_data folder, I've put it there under training_set.csv, so it is clear.
The paths to the train_test split folder were not correct, the files were being saved in the notebook folder. I've moved them to the corresponding folder
I've eliminated the contingency tables, since they depend on the cut off selected, they are not that informative, we will base our decision on the roc curves
I've done some clean up on the model training, please look at it to see and understand the changes. There were variables not being used, instances for example of cv being initialized when it had already been calculated ... it is important to always revise the code to make sure we clean up
For the final model, we do not need to predict the values, as we will predict on the same set as the training data, it won't be really informative. I don't understand why you were using a for loop in the final model training, since we use all the data this won't be necessary I hope this is helpful, please have a look and let me know!

Richiio commented 9 months ago

Thanks @GemmaTuron I've seen the clean-up you did. I had issues pointing to the train_test files I created and decided on keeping them with the notebooks. This was very helpful!

ersilia-os / ersilia