Clean UP & Dockerization eos30gr

febielin commented 1 year ago

I see that the model test workflow did not pass, and the upload to dockerhub and upload to s3 workflows did not run.

The error that appears on the model test log concerns rdkit, so I altered the Dockerfile from a conda install rdkit --> pip install rdkit. This seems to have resolved this error.

Now, this is the next error that I am encountering on github codespaces: eos30gr_error.log. I need to look more into tensorflow to understand what is happening.

[UPDATE] This discussion suggests that this tensorflow message is not an error after all.

Fixed this error by changes 'Lists' --> 'List' in service.py

Now, the problem seems to be an empty output error.

febielin commented 1 year ago

Surprisingly, the model works on codespaces when running compound_singles.csv: eos30gr_test.csv

On the other hand, this is the error that I receive when trying the eml_canonical file:

(eos30gr) root@codespaces-e54246:/workspaces/ersilia/eos30gr/model/framework# bash run.sh . ../../../notebooks/eml_canonical.csv ../../../data/eos30gr_test_output.csv
[21:44:45] SMILES Parse Error: syntax error while parsing: abacavir
[21:44:45] SMILES Parse Error: Failed parsing SMILES 'abacavir' for input: 'abacavir'
abacavir
None
Traceback (most recent call last):
  File "./code/main.py", line 31, in <module>
    mol = standardise.run(mol)
  File "/root/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/utils.py", line 99, in wrapper
    result = func(*args, **kwargs)
  File "/root/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/standardise.py", line 79, in run
    mol = Chem.MolFromMolBlock(input_mol)
TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type NoneType
(eos30gr) root@codespaces-e54246:/workspaces/ersilia/eos30gr/model/framework#

FYI, there are print statements that I added in main.py, and that's why you see 'abacavir' and 'None' Screenshot 2023-07-12 at 2 53 17 PM . It's strange that abacavir is coming out null after .MolFromSmiles().

GemmaTuron commented 1 year ago

Hi @febielin

Good job on the printing statements. If you are printing the SMILES, why do we get "abacavir" instead of the actual SMILES? It will fail because we are not passing a SMILES

febielin commented 1 year ago

@GemmaTuron

I just realized–

Is it the case that using the eml_canonical file with bash run.sh will never work? Because it doesn't go through Ersilia, and therefore, it does not get parsed for only the smiles strings?

This would explain a lot.

febielin commented 1 year ago

Since I can't run it anywhere else, I've pulled the ersiliaos/shell image on Docker, and I'm doing all my testing and edits there.

I initially ran into an error with a 'libxrender' library. I resolved that by adding an extra line to the Dockerfile: RUN apt-get install -y libxrender1

Now, I am getting this error. An online search tells me that it could be related to compatibility issues of 'gensim'.

GemmaTuron commented 1 year ago

Hi @febielin !

To be sure, I'd suggest having a file with only one column, the SMILES, not the names of the drugs. For the error you are getting, could it be we are not downloading the model file correctly (or it is corrupted?). Just to be sure, the model.joblib in this case is the actual model, not a dummy file This model is already re-trained by us since the original checkpoints were not provided, see the framework/train folder. If we are running into issues with the actual model, we could just re-train it once more. Have a look at the files and let me know what you think! You should be able to run this models in a MacOS unless the Mol2Vec has some kind of incompatibility

GemmaTuron commented 1 year ago

Also, I really don't like the Mol2Vec since it has many dependencies. If we go for retraining, we might think of using a different package @miquelduranfrigola ?

febielin commented 1 year ago

I am following the troubleshooting models guide via CLI. I have created a python 3.7 env and followed the installation instructions specified in the Dockerfile. These all downloaded fine without any problems. FYI, I am using my updated Dockerfile, where everything is installed with pip and rdkit version is upgraded to 2022.3.3.

When I use bash run.sh to test the model, I get this error: eos30gr_error.log. This seems to suggest something about regenerating the protobuf code for the model. I am unfamiliar with this, but I did some research and it recommends that I use the protoc compiler to regenerate code from the .proto file(s). However, nowhere in the model repository are there .proto files. This confuses me, but I would like to investigate this further to see if this is what's causing the empty output error. Do you have any knowledge / advice about this? @GemmaTuron

[UPDATE] I've resolved this error by running the following two lines: pip uninstall protobuf and pip install protobuf==3.19. This gave me a new error, an Assertion Error. The log file is here. There are extra items printed to the terminal because of these print statement that I added to the main.py file.

Now, I need to figure out what is causing the assertion statement to fail.

febielin commented 1 year ago

Now, if I try eml_canonical using this same approach (CLI, bash run.sh), I've found a list of molecules that trigger a standardiser error. The error usually looks something like this:

Traceback (most recent call last):
  File "./code/main.py", line 30, in <module>
    mol = standardise.run(mol)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/utils.py", line 99, in wrapper
    result = func(*args, **kwargs)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/standardise.py", line 144, in run
    raise StandardiseException("no_non_salt")
standardiser.utils.StandardiseException: No non-salt/solvate components

and on occasion, like this:

[15:30:39] WARNING: not removing hydrogen atom without neighbors
Traceback (most recent call last):
  File "./code/main.py", line 30, in <module>
    mol = standardise.run(mol)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/utils.py", line 99, in wrapper
    result = func(*args, **kwargs)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/standardiser/standardise.py", line 148, in run
    raise StandardiseException("multi_component")
standardiser.utils.StandardiseException: Multiple non-salt/solvate components

Once I got rid of all removed these molecules that trigger the StandardiseException from the input file, the next error I get is this index error: eos30gr_indexerror.log

The eml_canonical file doesn't make it as far as the compound_singles file.

GemmaTuron commented 1 year ago

Hi @febielin Thanks for this work. I think we should simply retrain this model again with updated packages. The standardiser error is normal, we need to ensure the model can deal with these molecules that cannot be processed. We can talk about it today in the 1:1

febielin commented 1 year ago

@GemmaTuron

I followed the instructions in the README file.

This is the error that appears after I run the script: eos30gr_lazyqsar.log. Despite this, I see that a new herg.tab file appear with data fileld in. Github does not support this file type, so I've copied the contents here for your reference.

I don't quite understand what this data represents. Is this the trained data? If so, how can I further process it such that this model outputs according to the newly trained data?

GemmaTuron commented 1 year ago

Hi @febielin Two separate things

Installation of LazyQSAR: did building the conda env give any problems? In my case for MacOS I had to get a newer version of FLAML, not the 1.0.2. Let's make sure this is working. Once the conda env is created, you can simply run
```
python
import flaml
```
to make sure everything is working
Let's go step by step, you can use a notebook if it is easier, and print the results of each step. Are you getting the data correctly? you can print the train, test and validation sets to be sure. Then, FLAML uses an automated framework to select a ML framework, you can pin it down to "rf" (random forest) only, to start easy: mdl = MorganBinaryClassifier(estimator_list=["rf"])

febielin commented 1 year ago

Hi @GemmaTuron,

(1) Maybe I am installing incorrectly. Based on the README.md instructions, these were the installation steps that I followed:

git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .

Where does the conda env build step come in? Please point me to any documentation / specify instructions if I am missing steps!

(2) Okay, I will get started on setting up a notebook for this. I think being able to print the sets will help me understand each step better. Just to be clear, the data is coming from TDC? Or am I using the Table_S4.xlsx file that is already present in the model?

GemmaTuron commented 1 year ago

oh I am sorry! You need to build a conda env where you want it installed, otherwise is gonna install it in your base env only. The same way you did for ersilia, you just first do:

conda create -n lazyqsar python=3.10
conda activate lazyqsar
git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .

GemmaTuron commented 1 year ago

I am not sure which table you point to. Just install TDC (pip install) in the same conda environment and then you can retrieve any dataset you want. You can do a colab notebook and share it here (make sure to give me edit access) and I can help you out.

GemmaTuron commented 1 year ago

so, if you successfully create the conda environment, something like this should work:

from tdc.single_pred import Tox
from sklearn.metrics import roc_curve, auc
import lazyqsar as lq

data = Tox(name = 'hERG')
print(data)

split = data.get_split()
print(split)

print(split["train"])
smiles_train = list(split["train"]["Drug"])
y_train = list(split["train"]["Y"])
print(split["valid"])
smiles_valid = list(split["valid"]["Drug"])
y_valid = list(split["valid"]["Y"])

print(y_train)
print(y_valid)

model = lq.MorganBinaryClassifier(estimator_list=["rf"]) 
model.fit(smiles_train, y_train)

preds = model.predict_proba(smiles_valid)
proba1 = preds[:,1]
print(proba1)

fpr, tpr, _ = roc_curve(y_valid, proba1)
print("AUROC", auc(fpr, tpr))

of course this has many prints, go step by step

febielin commented 1 year ago

Hi @febielin Two separate things

Installation of LazyQSAR: did building the conda env give any problems? In my case for MacOS I had to get a newer version of FLAML, not the 1.0.2. Let's make sure this is working. Once the conda env is created, you can simply run
python
import flaml
to make sure everything is working 2. Let's go step by step, you can use a notebook if it is easier, and print the results of each step. Are you getting the data correctly? you can print the train, test and validation sets to be sure. Then, FLAML uses an automated framework to select a ML framework, you can pin it down to "rf" (random forest) only, to start easy: mdl = MorganBinaryClassifier(estimator_list=["rf"])

Yes, thank you for pointing this out. When I do python import flaml, I get the following error:

(lazyqsar) febielin@Febies-MacBook-Air Ersilia % python
Python 3.10.12 (main, Jul  5 2023, 15:34:07) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flaml
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/flaml/__init__.py", line 2, in <module>
    from flaml.automl import AutoML, logger_formatter
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/flaml/automl.py", line 53, in <module>
    from flaml.default.suggest import suggest_learner
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/flaml/default/__init__.py", line 8, in <module>
    from .estimator import (
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/flaml/default/estimator.py", line 167, in <module>
    import lightgbm
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/lightgbm/basic.py", line 221, in <module>
    _LIB = _load_lib()
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/lightgbm/basic.py", line 206, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/lightgbm/lib/lib_lightgbm.so, 0x0006): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: <F47D69E4-1594-3171-ABCC-7156C3E263E1> /Users/febielin/miniconda3/envs/lazyqsar/lib/python3.10/site-packages/lightgbm/lib/lib_lightgbm.so
  Reason: tried: '/usr/local/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/usr/local/opt/libomp/lib/libomp.dylib' (no such file), '/usr/local/opt/libomp/lib/libomp.dylib' (no such file), '/usr/local/lib/libomp.dylib' (no such file), '/usr/lib/libomp.dylib' (no such file, not in dyld cache)

I will try to import a newer version of flaml

febielin commented 1 year ago

Hmm, I removed the version 1.0.2 and reinstalled lazy-qsar. setup.py was run again just flaml instead of flaml==1.0.2 but this error still presents itself.

Perhaps is this error unrelated to flaml? Or maybe is there another version that I should be specifying?

febielin commented 1 year ago

Hi @GemmaTuron,

I am still not able to find the source of this error when doing the python import flaml, but I've set up a Google Colab notebook, as you've suggested. Hopefully you have no problem accessing it. This notebook seems to work fine. The AUROC value is 0.8220424671385237.

Once again, I get the herg.tab at this path: /content/lazy-qsar/data/herg.tab. Does this herg.tab represent the newly trained data?

GemmaTuron commented 1 year ago

@febielin The notebook looks good, thanks! The remaining issues: herg.tab - I don't have this file, what is the /content folder you are mentioning? TDC downloads the data so that might be just that? I usually get a .zip file when downloading from tDC, but I get may datasets at the same time. Just a clarificationn - newly trained data does not make sense, we can have a newly trained model

About the flaml error, if you look at it it is the lightgbm, one of the ML frameworks used by FLAML, that fails. On a quick search I found this StackOverflow that could be useful.

febielin commented 1 year ago

@GemmaTuron

Sorry, I should clarify. That /content folder that I mentioned was a file in the notebook that I linked to you. Maybe this screenshot will help you understand where I am pointing you? It may also be that you have to run the notebook once to see it appear.

Also, thank you for the StackOverflow link–that was very helpful. Now, when I do the python import flaml, I no longer get the error. I take it that it works now? I'll try it out again.

(lazyqsar) febielin@Febies-MacBook-Air Ersilia % brew install libomp

Running `brew update --auto-update`...
==> Homebrew collects anonymous analytics.
Read the analytics documentation (and how to opt-out) here:
  https://docs.brew.sh/Analytics
No analytics have been recorded yet (nor will be during this `brew` run).

==> Downloading https://formulae.brew.sh/api/formula_tap_migrations.jws.json
######################################################################### 100.0%
==> Auto-updated Homebrew!
Updated 2 taps (homebrew/core and homebrew/cask).
==> New Formulae
adb-enhanced               fw                         quictls
alass                      go-feature-flag            rio
bacon                      govulncheck                ruff-lsp
bbot                       hex                        runme
bfs                        hoverfly                   scrapy
bilix                      ittapi                     sh4d0wup
blades                     libdivsufsort              sickchill
blink                      libversion                 solr@8.11
botan@2                    lr                         sqlpage
cargo-binstall             mailpit                    strip-nondeterminism
cbonsai                    mvt                        trufflehog
couchbase-shell            neonctl                    trzsz-ssh
cppinsights                ntbtls                     tzdiff
crabz                      pgrok                      webpod
crystalline                pixi                       wget2
cycode                     plog                       woof-doom
driftwood                  pop                        wpscan
dysk                       proxygen                   xe
erg                        pylyzer                    yyjson
erlang@25                  pypy3.10                   zchunk
espflash                   pypy3.9
fuc                        python-toml
==> New Casks
chatall                    glaze                      score
drata-agent                herd                       screen-studio
elektron-overbridge        maa                        showmeyourhotkeys
ente                       mycard                     ths
fedistar                   picoscope                  whisky
flexoptix                  poe

You have 2 outdated formulae installed.
==> Downloading https://formulae.brew.sh/api/cask_tap_migrations.jws.json
######################################################################### 100.0%

==> Fetching libomp
==> Downloading https://ghcr.io/v2/homebrew/core/libomp/manifests/16.0.6
######################################################################### 100.0%
==> Downloading https://ghcr.io/v2/homebrew/core/libomp/blobs/sha256:30cb6ac784e
######################################################################### 100.0%
==> Pouring libomp--16.0.6.ventura.bottle.tar.gz
==> Caveats
libomp is keg-only, which means it was not symlinked into /usr/local,
because it can override GCC headers and result in broken builds.

For compilers to find libomp you may need to set:
  export LDFLAGS="-L/usr/local/opt/libomp/lib"
  export CPPFLAGS="-I/usr/local/opt/libomp/include"
==> Summary
🍺  /usr/local/Cellar/libomp/16.0.6: 7 files, 1.7MB
==> Running `brew cleanup libomp`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
==> `brew cleanup` has not been run in the last 30 days, running now...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
Removing: /Users/febielin/Library/Caches/Homebrew/gh--2.31.0... (10.0MB)
Removing: /Users/febielin/Library/Caches/Homebrew/mpfr--4.2.0-p9... (1MB)
Removing: /Users/febielin/Library/Logs/Homebrew/gmp... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/mpfr... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/lz4... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/gcc... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/xz... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/zstd... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/isl... (64B)
Removing: /Users/febielin/Library/Logs/Homebrew/libmpc... (64B)
(lazyqsar) febielin@Febies-MacBook-Air Ersilia % python
Python 3.10.12 (main, Jul  5 2023, 15:34:07) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import flaml
>>>

GemmaTuron commented 1 year ago

ah I see now, this must be the dataset that is downloaded from TDC. About flaml, indeed it should work now

GemmaTuron commented 1 year ago

@febielin Have you been able to run the lazy qsar package?

febielin commented 1 year ago

Hi @GemmaTuron,

Once my flaml was working, I was able to successfully run these few lines of code with an .ipynb file. Again, I get an AUROC ~0.807.

Is this what you mean by 'run the lazy qsar package'?

GemmaTuron commented 1 year ago

Hi @febielin

Sorry if I was not clear, seems Lazy QSAR is working and the model for testing gave a good AUROC, so we can apply the same method to the eos30gr data and see what performance we get if we use a train test split? For that, you can use the sklearn package, use the train_test split function and train one model following the same rationale you did for the example data. Let me know if you need help in that!

febielin commented 1 year ago

Hi @GemmaTuron,

I've created another notebook in the lazyqsar conda environment. I want to replicate this process but with the eos30gr data. From what I can gather, the data is coming from a very large excel sheet called Table_S4.xlsx in model/framework/train/data. The existing KerasTuners files did processing on these files to gather training, validation, and testing files as .sdf files, and additional featurization into .csv files. The approach I have taken is to use these .csv files that have already been split.

Here is the notebook that I have so far:

import pandas as pd
import numpy as np
import joblib
import shutil
import lazyqsar as lq
from rdkit import Chem
from sklearn.metrics import roc_curve, auc

def process_file(file_name):
    column = "activity80"
    df = pd.read_csv(file_name)
    cols = [c for c in list(df.columns) if c.startswith("mol2vec-")]
    X = np.array(df[cols])
    y = np.array(df[column]).reshape(-1, 1) 
    print("DataFrame shape:", df.shape)  
    print("X shape:", X.shape)  
    return X, y

Xtr, ytr = process_file("output_file/trainingset_mol2vec.csv")
Xvl, yvl = process_file("output_file/validationset_mol2vec.csv")
Xte, yte = process_file("output_file/testset_mol2vec.csv")

X_train = np.vstack([Xtr, Xvl])
y_train = np.vstack([ytr, yvl])
X_test = Xte
y_test = yte

model = lq.MorganBinaryClassifier(estimator_list=["rf"])
model.fit(X_train, y_train)

preds = model.predict_proba(smiles_valid)
proba1 = preds[:,1]
print(proba1)

fpr, tpr, _ = roc_curve(y_valid, proba1)
print("AUROC", auc(fpr, tpr))

This is the current error that I am encountering. The error is specifically in the mol = Chem.MolFromSmiles(smi) line. I already installed rdkit into this conda environment and imported it, but it's still giving me this error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 model = lq.MorganBinaryClassifier(estimator_list=["rf"])
----> 2 model.fit(X_train, y_train)

File [~/Desktop/Ersilia/lazy-qsar/lazyqsar/binary/morgan.py:58](https://file+.vscode-resource.vscode-cdn.net/Users/febielin/Desktop/Ersilia/eos30gr/model/framework/train/~/Desktop/Ersilia/lazy-qsar/lazyqsar/binary/morgan.py:58), in MorganBinaryClassifier.fit(self, smiles, y)
     56 def fit(self, smiles, y):
     57     if self._automl:
---> 58         self.fit_automl(smiles, y)
     59     else:
     60         self.fit_default(smiles, y)

File [~/Desktop/Ersilia/lazy-qsar/lazyqsar/binary/morgan.py:27](https://file+.vscode-resource.vscode-cdn.net/Users/febielin/Desktop/Ersilia/eos30gr/model/framework/train/~/Desktop/Ersilia/lazy-qsar/lazyqsar/binary/morgan.py:27), in MorganBinaryClassifier.fit_automl(self, smiles, y)
     25 def fit_automl(self, smiles, y):
     26     model = AutoML(task="classification", time_budget=self.time_budget_sec)
---> 27     X = np.array(self.descriptor.fit(smiles))
     28     y = np.array(y)
     29     if self._reduced:

File [~/Desktop/Ersilia/lazy-qsar/lazyqsar/descriptors/descriptors.py:372](https://file+.vscode-resource.vscode-cdn.net/Users/febielin/Desktop/Ersilia/eos30gr/model/framework/train/~/Desktop/Ersilia/lazy-qsar/lazyqsar/descriptors/descriptors.py:372), in MorganDescriptor.fit(self, smiles)
    371 def fit(self, smiles):
--> 372     X = morgan_featurizer(smiles)
    373     self.features = ["fp-{0}".format(i) for i in range(X.shape[1])]
    374     return pd.DataFrame(X, columns=self.features)

File [~/Desktop/Ersilia/lazy-qsar/lazyqsar/descriptors/descriptors.py:361](https://file+.vscode-resource.vscode-cdn.net/Users/febielin/Desktop/Ersilia/eos30gr/model/framework/train/~/Desktop/Ersilia/lazy-qsar/lazyqsar/descriptors/descriptors.py:361), in morgan_featurizer(smiles)
    359 X = np.zeros((len(smiles), NBITS))
    360 for i, smi in enumerate(smiles):
--> 361     mol = Chem.MolFromSmiles(smi)
    362     X[i,:] = d.calc(mol)
    363 return X

TypeError: No registered converter was able to produce a C++ rvalue of type std::__1::basic_string, std::__1::allocator> from this Python object of type numpy.ndarray

GemmaTuron commented 1 year ago

Hi @febielin

I would not take the .csv files, since these are the output_files of running the mol2vec.py and contain the featurization of mol2vec, which can be confusing, since we are going to use another featurization. I would convert the .sdf files in input_files to .csv files (there is an RdKIT function for that) and use those as input, if that makes sense? therefore we will remove any need for mol2vec.

febielin commented 1 year ago

Updates:

I've written a python script to process the .sdf files into .csv files. This is what worked (code pasted below since .py file not supported):

from rdkit import Chem
from rdkit.Chem.ChemUtils import SDFToCSV

# input file format: *.sdf
# output file format: *.csv

# test
test_out = open('output_file/testset.csv', 'w' )
test_in = Chem.SDMolSupplier('input_file/testset.sdf')
SDFToCSV.Convert(test_in, test_out, keyCol=None, stopAfter=- 1, includeChirality=False, smilesFrom='')
test_out.close()
# validate
vldt_out = open('output_file/validationset.csv', 'w' )
vldt_in = Chem.SDMolSupplier('input_file/validationset.sdf')
SDFToCSV.Convert(vldt_in, vldt_out, keyCol=None, stopAfter=- 1, includeChirality=False, smilesFrom='')
vldt_out.close()
# train
train_out = open('output_file/trainingset.csv', 'w')
train_in = Chem.SDMolSupplier('input_file/trainingset.sdf')
SDFToCSV.Convert(train_in, train_out, keyCol=None, stopAfter=- 1, includeChirality=False, smilesFrom='')
train_out.close()

These are the resulting .csv files: testset.csv, trainingset.csv, and validationset.csv. Since we are no longer using mol2vec featurization, these csv files looks a little different. (In case you'd like to compare, here were the old mol2vec .csv files: testset_mol2vec.csv, trainingset_mol2vec.csv, validationset_mol2vec.csv.)

This means the kerastuner process_file function doesn't quite apply anymore (shown below), as the mol2vec- columns are gone:

def process_file(file_name):
    column = "activity80"
    df = pd.read_csv(file_name)
    df[df[column].isnull()] = 0
    cols = [c for c in list(df.columns) if c.startswith("mol2vec-")]
    X = np.array(df[cols])
    y = np.array(df[column])
    return X, y.reshape(-1,1)

Based on my understanding, X is the input feature columns generated by mol2vec and y is the column titled "activity80." In this case, with the current .csv files which lack mol2vec features, what would be "X"? Am I missing a step here, such as an alternative featurization step? Or, in the case of the example lazy-qsar model which we trained with TDC data, the input was the SMILES. Would the SMILES be "X"?

Similarly, I am wondering if the target variable here is activity80? If so, why is this the column that is being chosen? I don't think I have a firm grasp on this concept.

Here is a very rough Colab notebook which I've been trying to figure out all day. At this point, I think it would be helpful for me to take a step back and understand:

what X and Y are / should be
what the activity values represent (and why activity80 is the important variable in this case)
how to deal with NaN values, what do they represent, and how do I resolve them (by overwriting them with 0's)

Sorry for the long thread, hope this explains my current progress well enough.

GemmaTuron commented 1 year ago

Hi @febielin ,

Let me answer step by step:

Activity values We need to go back to the original publication to check exactly what are the different activity thresholds. There was a slight misinterpretation I think of the values used to retrain the model. Please read the paper carefully and check the supplementary data, but, in summary, the authors use IC50<10uM as cut-off and then take several different negative values: all that are => 10 uM, => 20uM, =>60uM, =>80uM and => 100uM. By making the distance between actives and inactives bigger, in principle you facilitate model training, because you don't use molecules that are close to the cut-off and therefore could be confusing. Sup Table 3 is helpful to understand that, also why there are so many Nan values in some columns. I would just go with the first cut-off, where positive hERG blockers (1) are those with IC50<10uM. and negative hERG blockers (0) are those with IC50=>10uM. Hence, the first column, activity 10

X values We always need to featurize molecules, and there are different ways of doing so. This step is embedded within LazyQSAR for simplicity, if you look into the code of Lazy QSAR, you will see we have different featurizers, which one did you use for the TDC test? -- I'll let you revise the code before answering that.

febielin commented 1 year ago

Since the TDC code uses MorganBinaryClassifer, I believe the featurizer I use is MorganDescriptor. Same goes for this current model that I am working on, for which I am also using lq.MorganBinaryClassifier().

Thank you for the clarification! This really helped me understand the significance behind the different activity values. It is a good thing we are retraining the model to now use activity10. I have since followed the same steps that I used for the TDC data, and it seems to work well for this model. I was able to get AUROC 0.9794817164338503 by the end. Please review this Colab Notebook.

GemmaTuron commented 1 year ago

Hi @febielin !

Indeed, we are using Morgan at this step. Thanks for sharing the code, it looks good! A few pointers:

The autoML time_budget is set to very low by default, you can augment it to 10 minutes for example (600 seconds), when instantiating it lq.MorganBinaryClassifier(estimator_list=["rf"], time_budget=600). This will hopefully remove the warning:

[flaml.automl: 07-31 04:21:28] {2604} WARNING - Time taken to find the best model is 76% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
WARNING:flaml.automl:Time taken to find the best model is 76% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget

We should also check the performance on the test set, not only the validation
We could use another descriptor, for example ErsiliaCompoundEmbeddings, to compare the performance

If all this works fine, we can collate all the data and train a final model, save it and add it to the repo!

febielin commented 1 year ago

Thanks @GemmaTuron,

I am still working on this with both the Morgan and the Ersilia Compound Embedding. I just wanted to share that even with 600 seconds as the time budget, I get this warning:

[flaml.automl: 07-31 15:00:02] {2604} WARNING - Time taken to find the best model is 96% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
WARNING:flaml.automl:Time taken to find the best model is 96% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.

I will continue playing around with the time budget parameter to hopefully get rid of that warning.

febielin commented 1 year ago

Update:

Morgan: https://colab.research.google.com/drive/1eeVFRrQzxd1SA5Q59C-8B6_DdZUW2Jbq#scrollTo=Mu41yIk5MG3t AUROC 0.9957695224059123 Time taken to find the best model: 463.1529507637024
Ersilia Compound Embedding: https://colab.research.google.com/drive/1eimXejWN3APYwxMjYhxIMMd64pNTLXlx?usp=sharing AUROC 0.9908008137895681 Time taken to find the best model: 1552.318109512329

Both descriptors work and give good AUROC values, but the Morgan Descriptor has far better performance.

GemmaTuron commented 1 year ago

Hi @febielin Good start. Just to be more precise,model perfomance refers usually to AUROC values, not the fact that it takes longer to train the model. I think both models will be equally fast in making predictions. These AUROCS are very high, which might indicate overfitting, but before going into that, a couple of comments:

Can you explain me this line of code? y_train = [0 if value != value else value for value in y_train] - please revise what are the NaN values in the dataset, and how we should deal with them And calculate the AUROCS for the test set as well to compare results

febielin commented 1 year ago

Hi Gemma,

I have made the following changes to the Colab notebook:

added performance for the test set
removed all rows from test, training, and validation sets that include NaN values in the activity10 column

Please review and let me know if this follows what we discussed in today's meeting.

GemmaTuron commented 1 year ago

Hi @febielin !

Great, yes that is what I mean. I suggest, when removing rows, to print, instead of the whole dataframe, just its shape, so you can see how many molecules we dropped:

train = pd.read_csv('trainingset.csv')
print(train.shape)
train = train[~train["activity10"].isna()]
print(train.shape)

Now we can train a full model - collating the three datasets, train, test and valid, and saving the model as a joblib file so we can load it and make predictions

febielin commented 1 year ago

Okay, I'm doing some investigative work into the training, validation, and test set. I've identified one smiles O=C1CN(CCc2ccc(F)cc2)CCN1C1CCc2cc(CN3CCS(=O)(=O)CC3)ccc2C1 that is present in both the training and validation set. This makes me wonder if there is more overlap between the three sets. I will spend some more time looking into this.

[UPDATE] Currently, I have this Colab notebook to identify duplicates . I am trying to filter for duplicate smiles between the three sets, as well as within the same set.

Between the sets: As mentioned above, I have identified a smiles that is present in both the training and validation set. A simple CTRL-f shows that they are both in the 'SMILES' column, yet they have different 'Smiles' values. By definition, is this considered the same or different entries? Also, what is the difference between the 'SMILES' and the 'Smiles' column?

Within the sets: There are several duplicates that could be seen in the colab notebook, but I am noticing that the activity values differ. For example, there are multiple entries of O=C(NS(=O)(=O)c1ccc(cc1)OCCCN1CCCC1)Nc1ccc(cc1)C(=O)C:

SMILES	No.	Smiles	activity10	activity20	activity40	activity60	activity80	activity100
CC(=O)c1ccc(NC(=O)NS(=O)(=O)c2ccc(OCCCN3CCCC3)cc2)cc1	D3196	O=C(NS(=O)(=O)c1ccc(cc1)OCCCN1CCCC1)Nc1ccc(cc1)C(=O)C	nan	0.0	nan	nan	nan	nan
CC(=O)c1ccc(NC(=O)NS(=O)(=O)c2ccc(OCCCN3CCCC3)cc2)cc1	D3196	O=C(NS(=O)(=O)c1ccc(cc1)OCCCN1CCCC1)Nc1ccc(cc1)C(=O)C	nan	nan	0.0	nan	nan	nan
CC(=O)c1ccc(NC(=O)NS(=O)(=O)c2ccc(OCCCN3CCCC3)cc2)cc1	D3196	O=C(NS(=O)(=O)c1ccc(cc1)OCCCN1CCCC1)Nc1ccc(cc1)C(=O)C	nan	nan	nan	0.0	nan	nan
CC(=O)c1ccc(NC(=O)NS(=O)(=O)c2ccc(OCCCN3CCCC3)cc2)cc1	D3196	O=C(NS(=O)(=O)c1ccc(cc1)OCCCN1CCCC1)Nc1ccc(cc1)C(=O)C	nan	nan	nan	nan	0.0	nan

What is the purpose of entries like this? Should these considered duplicates?

GemmaTuron commented 1 year ago

Hi @febielin

Since we are only considering the column activity10, these molecules won't in fact be in the training sets at all, since they won't be taken into acocunt. It seems they collated individual rows for each molecule if it has data on the different cut off values. The SMILES columns is likely due to smiles cleaning, i.e converting them to canonical form. They both represent the same molecule, which you can check if you draw them with any software. I think this was done by us already, or are these datasets coming from the publication directly?

febielin commented 1 year ago

Hi @GemmaTuron,

Alright, since they are the same molecule, I assume I could use either column and it would work the same. This data I obtained from the current model repo. All I did was convert it from .sdf to .csv with RDKit.

Also, you make a good point that these SMILES wouldn't even make it into the dataframe. I've edited the Colab notebook to make sure that the sets we're using only include the non-null activity10 rows. And according to my notebook, it would seem to confirm that there are not duplicates within or between sets!

I believe at this point, we are ready to do one final training and save the model as a joblib file. What exactly does collating the data entail? Does it mean storing the training, validation, and test sets in the
checkpoints folder?

GemmaTuron commented 1 year ago

Hi @febielin Yes, I think we can dot he final model training. For the final model training we need to use ALL available data, so put in a single df the test, validation and train sets and use all of them for the training of the final model, which should be saved as a .joblib. You can use any SMILES to test the saved model once finished the training

febielin commented 1 year ago

@GemmaTuron

[UPDATE] I have done the final model retraining by concatenating all of the test, validation, and train sets and training the model via the Morgan Binary Classifier. I saved it as a .joblib file in the checkpoints folder. The .sdf and .csv files are available in the framework/train folder.

Before I can test with any SMILES, I have to rewrite the main.py file because it still references mol2vec as the featurizer. In this case, I need to restructure it to use MorganDescriptor.

febielin commented 1 year ago

@GemmaTuron

I am trying to modify the main.py file to fit our newly trained model with lazy-qsar. The issue that I am running into is that all of my predictions are coming out to 0.0. I tried all of the eml SMILES–all zeros. When I modify the file as I have below,

mdl_ckpt = os.path.join(root, "..", "..", "checkpoints", "model.joblib")
model = joblib.load(mdl_ckpt)

...

new_smiles = ["CCN", "CCO"]
predictions = model.predict(new_smiles)
probabilities = model.predict_proba(new_smiles)

print("Predictions:", predictions)
print("Probabilities:", probabilities)

I get the following:

Predictions: [0. 0.]
Probabilities: [[0.93267327 0.06732673]
 [0.9210396  0.0789604 ]]

It seems that all my predictions are 0, but I am getting probabilities.

After doing some research, I realize it's because these SMILES are not being featurized. I know we used Morgan.py to featurize our molecules, but this was embedded within the lazy-qsar code. Since we already have our model checkpointed, how can we isolate just the featurization step of Morgan and then use the trained model to make predictions?

This is a little snippet of what ChatGPT recommends:

mdl_ckpt = os.path.join(root, "..", "..", "checkpoints", "model.joblib")
model = joblib.load(mdl_ckpt)

classifier.model = model.model
classifier._n_pos = model._n_pos
classifier._n_neg = model._n_neg
classifier._auroc = model._auroc
classifier.meta = model.meta

with open(os.path.join(tmp_dir, "input.csv"), 'r') as csvfile:
    df = pd.read_csv(os.path.join(tmp_dir, "input.csv"), skiprows=[0])

X_featurized = classifier.descriptor.fit(list(df))
print(X_featurized)

y = model.predict(X_featurized)

but this provided solution is giving me the following error:

(eos30gr) febielin@Febies-MacBook-Air framework % bash run.sh . ~/Desktop/Ersilia/partial_eml/eml_smiles.csv -o ~/Desktop/Ersilia/eos30gr_test.csv
[12:13:06] Can't kekulize mol.  Unkekulized atoms: 0 1 3 4 7 9
[12:13:07] Can't kekulize mol.  Unkekulized atoms: 0 1 3 5 8 10
   fp-0  fp-1  fp-2  fp-3  fp-4  ...  fp-2043  fp-2044  fp-2045  fp-2046  fp-2047
0   0.0   0.0   0.0   0.0   0.0  ...      0.0      0.0      0.0      0.0      0.0

[1 rows x 2048 columns]
[12:13:09] SMILES Parse Error: syntax error while parsing: fp-0
[12:13:09] SMILES Parse Error: Failed parsing SMILES 'fp-0' for input: 'fp-0'
Traceback (most recent call last):
  File "./code/main.py", line 67, in <module>
    y = model.predict(X_featurized)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/lazyqsar/binary/morgan.py", line 72, in predict
    X = np.array(self.descriptor.transform(smiles))
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/lazyqsar/descriptors/descriptors.py", line 377, in transform
    X = morgan_featurizer(smiles)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/lazyqsar/descriptors/descriptors.py", line 362, in morgan_featurizer
    X[i,:] = d.calc(mol)
  File "/Users/febielin/miniconda3/envs/eos30gr/lib/python3.7/site-packages/lazyqsar/descriptors/descriptors.py", line 353, in calc
    v = rd.GetHashedMorganFingerprint(mol, radius=self.radius, nBits=self.nbits)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.GetHashedMorganFingerprint(NoneType)
did not match C++ signature:
    GetHashedMorganFingerprint(RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, boost::python::api::object invariants=[], boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)

febielin commented 1 year ago

Hi @GemmaTuron,

I followed your advice from today's meeting, specifically these steps:

mdl.fit
mdl.save
mdl.predict_proba
y = y[:,1]

I did some tests with bash run.sh on CLI, and here are the results. Some files I ran twice simply to confirm consistent results:

compound_single.csv: output
compound_singles.csv: output1, output2
eml_smiles.csv: output1, output2 codespaces_output

Note: For the eml runs, output1 and output2 use the modified list of eml strings (only the ones that can be processed by standardiser), while codespaces_output uses the entire eml_file. That's the reason why there are a lot more outputs that are 0.

Also, I get this message:

(eos30gr) febielin@Febies-MacBook-Air framework % bash run.sh . ~/Desktop/Ersilia/partial_eml/eml_smiles.csv ~/Desktop/Ersilia/eos30gr_eml_2.csv
[10:28:25] Can't kekulize mol.  Unkekulized atoms: 0 1 3 4 7 9
[10:28:26] Can't kekulize mol.  Unkekulized atoms: 0 1 3 5 8 10

Should this raise concern?

[UPDATE] I will be submitting a PR for this today.

GemmaTuron commented 1 year ago

Hi @febielin

Good job! The unkekulized warning is simply a message from the standardiser while trying to sanitise molecules. If it can't, it will simply leave them as none, that is why you get some empty results. I think it is good to go!

GemmaTuron commented 1 year ago

Hi @febielin The changes are merged, as you will see I've modified the metadata to more accurately reflect what we did, and added you as the model contributor. One last piece that I think is not needed in the main.py is the conversion to sdf files, since this was only because the training set was provided in this format, but model users will simply pass -csv files already, do you agree? I'm referring to this bit:

mols = []
for i, smi in enumerate(smiles):
    mol = Chem.MolFromSmiles(smi)
    mol = standardise.run(mol)
    if mol is not None:
        smi = Chem.MolToSmiles(mol)
        mol = Chem.MolFromSmiles(smi)
        mol.SetProp("MoleculeIdentifier", "id-{0}".format(i))

    mols += [mol]

sdfile = os.path.join(tmp_dir, "input.sdf")
with open(sdfile, 'w') as file:
    writer = Chem.SDWriter(sdfile)
    for mol in mols:
        if mol is not None:
            writer.write(mol)
    writer.close()

# convert from SDF to CSV file
sdfile_in = Chem.SDMolSupplier(sdfile)
csvfile = open(os.path.join(tmp_dir, "input.csv"), 'w')
SDFToCSV.Convert(sdfile_in, csvfile, keyCol=None, stopAfter=- 1, includeChirality=False, smilesFrom='')
csvfile.close()

febielin commented 1 year ago

@GemmaTuron

Yes, I agree! Let me make the edits to main.py and I will submit a final PR. Thank you for pointing this out.

[UPDATE] PR: https://github.com/ersilia-os/eos30gr/pull/5

febielin commented 1 year ago

All the actions have passed! This model is ready for testing.

GemmaTuron commented 1 year ago

I've tested the model and it works fine, good job! I'll close this as completed

ersilia-os / eos30gr

Clean UP & Dockerization eos30gr #1