Closed Richiio closed 9 months ago
Hi @Richiio I've updated the metadata, have a look! To work on this, the best will be to use a notebook - you can add it in the same repository so we can discuss model performances. I suggest looking at our training material to understand how to get started, we have many examples of the lazy-qsar package application
/approve
@Richiio ersilia model respository has been successfully created and is available at:
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
README.md
file to accurately describe your modelIf you have any questions, please feel free to open an issue and get support from the community!
Hi @GemmaTuron @HellenNamulinda Using the publication link, the authors didn't include the datasets used via the publication link given. They did however, provide how they got their datasets which was from ChemBL and the process they went through in order to get their training dataset which can be seen from this gotten from their publication
The model training set was extracted from ChEMBL (17) (https://www.ebi.ac.uk/chembl/) by searching for whole-cell growth inhibition data versus N. gonorrhoeae with the query keywords “Neisseria gonorrhoeae” and “MIC”. The curation process consisted of manually inspecting the dataset to remove duplicates, selecting a conservative MIC value of 8.0 μg/mL, converting as necessary MIC units from μM to μg/mL, as well as removing false positives, which are compounds that were incorrectly deposited in ChEMBL as actives against N. gonorrhoeae. Subsequently, a manual inspection of the 2D structures of all active compounds was performed to detect reactive and Pan Assay INterference compoundS (PAINS) (18, 19), which we then excluded from the training set in a process called “structural pruning” – an extension of the data pruning approach described in our previous work (20, 21). The final dataset contains 282 compounds, of which 160 (56.7%) were labelled as active compounds
The supplementary data they provided is shown below:
NIHMS1715097-supplement-Supplemental_file.docx
This file contains analysis they performed and explanations to enable us understand what they did. However, it contains no dataset file, although we were meant to have datasets saved as sdf files here
I was able to find some of the datasets on another site where they released their publication. This can be found here: https://link.springer.com/article/10.1007/s11095-020-02876-y. But I was expecting 6 dataset files as found in doc file provided above but I got five which wasn't named correctly. However, when I carried out some data analysis of the files, I have an inituition which file was used for data analysis based on the publication. The notebook can be found here
https://colab.research.google.com/drive/1Q8N8eam513A3NViTfiVqQypZJqOVE93q?usp=sharing
Hi @Richiio !
Indeed, that is the original publisher website, pubmed is just an aggregated of the publications indexed - so you got the data correctly https://link.springer.com/article/10.1007/s11095-020-02876-y#Sec6 The work so far is good, could you please move the notebook to the github repo, and add the files there as well under a folder /data? so we can have everything in the same place.
Next you should clean up the training file to end up with the list of compounds (smiles) and their activity for N.gonorreahe. With that, we can start training the model. But best to do everything on github directly.
Hi @GemmaTuron @HellenNamulinda I have updated my repo on tasks done so far. A brief summary:
To-do
Suggestions
Thanks for the update and explanations @Richiio, can you share the links to the notebooks?
Hi @GemmaTuron Apologies for that, I thought I included that in the repo:
https://colab.research.google.com/drive/1a28Kro03Bfy2SPzTnh6H-Ymkeq-l5u-P?usp=sharing
Hi @Richiio
Please update all code and data to the repository and then let's start by going back to the data processing. How do you obtain the SMILES from the IUPAC names? I see you are trying on one hand the STOUT package and on the other to get the SMILES from ChEMBL?
training_data['SMILES'] = training_data['Molecule Name'].apply(lambda x: get_smiles_from_chembl(x))
I am seeing the same smiles all over, not different ones.
Please clean up the data processing notebook so that is easier to follow. Focus only on the first file, not the rest.
Hi @GemmaTuron I initially wanted to get the SMILES directly from CheMBL using an API, but as you rightly mentioned I was getting the same results all over, which wasn't the output I wanted.
So, instead of doing that I decided to use the SMILES2IUPAC package to get my SMILES, initially, there were some IUPAC names, it couldn't provide accurate SMILES for, specifically those that has curly brackets: "{" and not the normal square brackets "[". It threw an error for this. As seen below:
N-{4-[2-(pyridin-3-yl)-1,3-thiazol-4-yl]phenyl}acetamide
I handled this by looking through the data as it was small and changing the { to [ .
However, if there were still some inputs I had overlooked, I asked it to input a different value for this such as None and checked for those manually on CheMBL, then added it to my dataset.
We also had the case of an input such as this as the IUPAC name:
(1R,3S,5R,8R,10R,11S,12S,13R,14S)-8,12,14-trihydroxy-5-methyl-11,13-bis(methylamino)-2,4,9-trioxatricyclo[8.4.0.0³,⁸]tetradecan-7-one
For these inputs, I checked for their SMILES directly from CheMBL
I just observed something I had missed previously:
Not all IUPAC are loaded when I ran:
training_data = PandasTools.LoadSDF(File)
The log is in the notebook
Hi @Richiio
Good, as always please provide numbers - how many names are not loaded from the sdf file? how many needed to be manually fetched? If there are things you tried and did not work please do not keep them in the final notebooks as it makes difficult to follow the process. Update all the code and data to github (and make sure to point to the right paths for loading the data etc) so we can have a closer look
Hi @GemmaTuron 18 of these molecules were not loaded as I got an error for 18 of these rows. This checks out the missing values in the dataset. I have 264 rows of data but the publication mentioned they trained on 282 rows of data. Here is the error log:
training_data = PandasTools.LoadSDF(File) #Not all my molecules are parsed. The error I referenced in my issue.
[09:42:47] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13064
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13231
[09:42:47] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13320
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13419
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13507
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13592
[09:42:47] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13678
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 11 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13764
[09:42:47] ERROR: Explicit valence for atom # 11 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 13848
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Warning: ambiguous stereochemistry - zero final chiral volume - at atom 20 ignored
[09:42:47] Explicit valence for atom # 10 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 18250
[09:42:47] ERROR: Explicit valence for atom # 10 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 21494
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 21574
[09:42:47] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23764
[09:42:47] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 9 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23844
[09:42:47] ERROR: Explicit valence for atom # 9 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 18 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 23962
[09:42:47] ERROR: Explicit valence for atom # 18 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 5 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 31515
[09:42:47] ERROR: Explicit valence for atom # 5 N, 4, is greater than permitted
[09:42:47] Explicit valence for atom # 17 N, 4, is greater than permitted
[09:42:47] ERROR: Could not sanitize molecule ending on line 32931
[09:42:47] ERROR: Explicit valence for atom # 17 N, 4, is greater than permitted
For the notebooks: My first train, although it had a better AUC-ROC curve score (0.83), it was only actively predicting active molecules and inaccurately predicting inactive molecules as can be seen in the confusion matrix here: https://github.com/Richiio/eos5cl7/blob/main/Notebooks/m3_building_a_model.ipynb We had this result initially even though the active and inactive balance was preserved. I hadn't tried other hyperparameters here
This was my best so far in the four notebooks I ran. I had an AUC-ROC curve score of 0.73, it was able to predict 15 active and inactive molecules correctly. Here is the result when I calculated for accuracy, precision, f1-score and recall: Accuracy: 66.67% Precision: 68.18% Recall: 65.22% F1-Score: 66.55% Which was the best it has done so far. https://github.com/Richiio/eos5cl7/blob/main/Notebooks/2nd_train_no_change.ipynb
The rest of the notebooks are in my repo. https://github.com/Richiio/eos5cl7/tree/main/Notebooks
Since, I used SMILES2IUPAC to have my conversion, I am not confident about the accuracy of the SMILES it brought up given the inconsistency in the IUPAC representation of the data.
The model for the second notebook has been saved pending improvements
Hi @Richiio
Thanks for this. We will go over them live in tomorrow's meeting if we have time!
Hello, @Richiio After a further look at the data columns, we don't have to manually look up for SMILES. The data includes a column named ROMol (Read-Only Molecule), which is an instance of the ROMol class in RDKit. This column can be used directly to obtain SMILES representations.
If you specify the SMILES column name using the smilesName parameter (in this case, 'SMILES') when loading the file, the PandasTools will automatically add the SMILES column to the DataFrame.
raw_data = PandasTools.LoadSDF(File, smilesName='SMILES')# Create a pandas DataFrame with 'SMILES' column.
#If Smiles Column isn't specified while loading, it can be added by raw_data['SMILES'] = data['ROMol'].apply(lambda mol: Chem.MolToSmiles(mol) if mol is not None else None)
Meanwhile, their naive Bayesian model trained using ECFP6 fingerprints showed somewhat better results with AUC ROC = 0.8470, recall = 0.7736, precision = 0.8200, specificity = 0.7769, F1 score = 0.7961, CK = 0.5456, and MCC = 0.5467.
This performance difference is to be attributed to the dataset size discrepancy.
The warning about the molecule being tagged as 2D, but with at least one Z coordinate not zero, might have led to the failure to load the 18 molecules. Their paper mentions a manual inspection of 2D structures to detect reactive and PAINS compounds, which were then excluded.
So, we have to treat the training data provided as raw data and do follow the steps they did, including removal of salts.
@GemmaTuron, I understand using the dataset used in the original paper is necessary for direct comparison and validation.
But incase it's permissible, the chembltools retrieved assays for Neisseria gonorrhoeae from the ChEMBL database(chembl_33). It can be an opportunity to incorporate additional data into the training set, considering that updates may have occurred post-2021, when the original paper was published.
In the original paper, the authors manually inspected the dataset, selected a conservative MIC value of 8.0μg/mL( with unit conversions). We could consider a similar approach with the newly acquired data if it is permitted to extend the dataset. The problem might be with the removal of false positives(compounds incorrectly labeled as actives against N. gonorrhoeae in ChEMBL)
@HellenNamulinda @GemmaTuron Using the command @HellenNamulinda suggested, I was able to get the SMILES column, which has validated the accuracy of our SMILES. This error still persisted though which was to be expected as seen below:
[21:52:45] Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D.
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13064
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13231
[21:52:45] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13320
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13419
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13507
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13592
[21:52:45] ERROR: Explicit valence for atom # 1 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13678
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 11 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13764
[21:52:45] ERROR: Explicit valence for atom # 11 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 13848
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Warning: ambiguous stereochemistry - zero final chiral volume - at atom 20 ignored
[21:52:45] Explicit valence for atom # 10 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 18250
[21:52:45] ERROR: Explicit valence for atom # 10 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 21494
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 21574
[21:52:45] ERROR: Explicit valence for atom # 7 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23764
[21:52:45] ERROR: Explicit valence for atom # 8 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 9 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23844
[21:52:45] ERROR: Explicit valence for atom # 9 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 18 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 23962
[21:52:45] ERROR: Explicit valence for atom # 18 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 5 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 31515
[21:52:45] ERROR: Explicit valence for atom # 5 N, 4, is greater than permitted
[21:52:45] Explicit valence for atom # 17 N, 4, is greater than permitted
[21:52:45] ERROR: Could not sanitize molecule ending on line 32931
[21:52:45] ERROR: Explicit valence for atom # 17 N, 4, is greater than permitted
So we can conclude that we would be working with only 264 molecules against the 282 used for training by the original authors
New results from model training with the new dataset
AUC-ROC curve: 0.81
The following results were gotten: Accuracy: 79.25% Precision: 75.86% Recall: 84.62% F1-Score: 79.89%
@GemmaTuron @HellenNamulinda This is much better than the results I got previously.
The model has been saved
@HellenNamulinda that was a very good suggestion many thanks! @Richiio now that you are obtaining more consistent results, try to reformat your code as we were discussing today, add the folders under the /framework and have one notebook per task (data processing, model training) you can save the images of the ROC curves for example (5-fold cross validated) in a /figures folder for easy review.
Hi @GemmaTuron I've reformated based on the suggestions you made in the meeting. Currently working on resolving the error I am getting when incorporating to the hub
TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type ExplicitBitVect
I have opened a pull request for the model.
Hi @Richiio
I've merged the PR but before I work on your code to provide feedback please:
Hi @GemmaTuron @HellenNamulinda I got an error when retraining with the 5fold cross-validation. This error is the same that was produced by the Ersilia's model hub when I made a pull request on the model.
The notebook with the error can be found on my github profile here https://github.com/Richiio/eos5cl7/blob/main/model/framework/code/02_Model_training.ipynb
Hi @Richiio The error you are getting is related to RDKIT processing of the smiles, so I suggest you look closer at what you are inputting to rdkit. It is not related to the cross validation.
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetHashedMorganFingerprint(NoneType)
did not match C++ signature:
GetHashedMorganFingerprint(RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)
In your notebook, I don't understand this bit of code:
# Convert Morgan fingerprints to SMILES strings
smiles_train = ["".join(map(str, x)) for x in X_train]
smiles_test = ["".join(map(str, x)) for x in X_test]
Why do you convert back Morgan fps to smiles? This seems to be the issue as the smiles_train that you are passing does not seem to be in the correct format. Make sure to use print statements to see what you are passing in the function. Also remember that 60 seconds is not enough for an autoML algorithm and we would like to see what happens with longer times, and other descriptors as well, not only morgan. Work on these changes and let me know the outcome
@GemmaTuron When I had multiple notebooks, I was training with Morgan and ersilia embeddings and on each train iteration, Morgan performed better than ersilia's embeddings. I decided to work strictly with Morgan but I would test with ersilia for the 5fold and get back to you
Hi @Richiio We are exploring the space and trying to find the best model, so do not delete any until we decide what works best. Please write a good summary here of the different results once you complete the different trainings with the improvements I suggest above. Include things like how many tests have you done, the performance of each model, which model would you choose as best, how we could validate it...
Hi @GemmaTuron I've been unable to train any models due to an error with rdkit when trying to get molfromSmiles. The original maintainer decribed a bug which can be found here: "https://github.com/rdkit/rdkit/issues/7036"
When just running the initial notebook from Ersilia. The error I described above persisted
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(NoneType, int, int)
did not match C++ signature:
GetMorganFingerprintAsBitVect(RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)
Hello @Richiio, Just like @GemmaTuron asked, what exactly is this code doing?
# Convert Morgan fingerprints to SMILES strings
smiles_train = ["".join(map(str, x)) for x in X_train]
smiles_test = ["".join(map(str, x)) for x in X_test]
Remember when you convert to Morgan fingerprints, you end up with binary vectors(0s and 1s). So, that code will give you smiles as strings of 0s and 1s(which is wrong)
I don't understand why you are generating morgan fingerprints( and then convert them to smiles) since you are using lazy-qsar. Lazy qsar expects smiles inputs and it will generate the corresponding descriptors based on what you choose forexample lq.MorganBinaryClassifier(for Morgan Fingerprints), lq.MordredBinaryClassifier(for Mordred descriptors) or lq.ErsiliaBinaryClassifier(for ersilia embeddings).
Check your previous notebook, you were doing something like below
# Filter out rows with None or invalid SMILES
train = train.dropna(subset=['SMILES'])
train = train[train['SMILES'].apply(lambda x: Chem.MolFromSmiles(x) is not None)]
# Separate features (SMILES) and target variable (EXP)
smiles_train = train[SMILES]
y_train = train[EXP]
# Initialize and fit the model
model = lq.MorganBinaryClassifier(time_budget_sec=60, estimator_list=["rf", "lgbm", "xgboost"])
model.fit(smiles_train, y_train)
The fingerprints you are generating with the function preprocess_and_generate_fingerprints can be used to train a model from scratch(say naive bayes), but not as inputs to lazy-qsar model.
Hi @HellenNamulinda That was an error with my comments. That is just a check to ensure that the Smiles I am passing to the convert to Morgan fingerprints are in strings.
This error I showed above still occurs when I am just running the example notebook that was provided. The error has something to do with the rdkit model. An issue was raised by someone on Friday concerning that
In the function cross_validation, I am converting to Morgan fingerprints by calling the convert_to_morgan_fingerprints, then performing mod training
Hi @HellenNamulinda That was an error with my comments. That is just a check to ensure that the Smiles I am passing to the converted to Morgan fingerprints are in strings.
This error I showed above still occurs when I am just running the example notebook that was provided. The error has something to do with the rdkit model. An issue was raised by someone on Friday concerning that
In the function cross_validation, I am converting to Morgan fingerprints by calling the convert_to_morgan_fingerprints, then performing mod training
I am getting the error from this line Chem.MolFromSmiles(x). It doesn't generate 0's and 1's as it should
Hi @Richiio
The issue you are pointing in RDKIT has nothing to do with your error, or none that I can see directly. Can you please describe why you are pointing to this bug as the source of your error?
The function MolFromSmiles(smiles) generates a molecule object in RDKIT and requires that you are passing a SMILES string. The GetMorganFingerprintAsBitVect requires a molecule Please revise the code and make sure the x you are passing is a SMILES string and then a molecule to the next function. There is plenty of erro information if you search google.
The line that should give you the error tip is:
rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(NoneType, int, int)
@GemmaTuron I am getting this error because I have a nonetype in mol when I am meant to be having a vector as in [[0, 1, 0], [1,0,0]]. Something like this but instead I am passing in a None as my smiles wasn't converted to a mol by the rdkit function.
I had the same error when I just ran the example notebook provided by Ersilia. Not my personal notebook.
@Richiio
The bug report in RDKIT does not have anything to do with the issue you describe, from what I understand. Please explain that to me so I can provide further comments. You need to modify the code to make sure you are not passing a None object. Do not trust the code as it is, work on it to make sure it works.
@GemmaTuron @HellenNamulinda Model Training results with a 5Fold cross-validation
Model Training | Accuracy | ROC-AUC score | Precision | Recall |
---|---|---|---|---|
Morgan Fingerprints, Train time (600) | 79.55 | 0.81 | 78.26 | 81.82 |
Morgan Fingerprints, Train time (60) | 84.09 | 0.85 | 77.78 | 95.45 |
Ersilia Embeddings, Train time (60) | 70.45 | 0.83 | 68.00 | 77.27 |
Ersilia Embeddings, Train time (600) | 77.27 | 0.83 | 75.0 | 81.82 |
Model Name
Growth Inhibitors of Neisseria gonorrhoeae
Model Description
The authors curated a dataset of 282 compounds from ChEMBL, of which 160 (56.7%) were labelled as active N. gonorrhoeae inhibitor compounds. They used this dataset to build a naïve Bayesian model and used it to screen a commercial library. With this method, they identified and validated two hits. We have used the dataset to build a model using Ersilia’s set of AI modelling tools.
Slug
ngonorrhoeae-inh
Tag
Antimicrobial activity, ChEMBL
Publication
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8274436/
Source Code
https://github.com/ersilia-os/lazy-qsar
License
GPL-3.0