ersilia-os / eos9sa2

Drug-likeness prediction based on Bayesian neural networks
GNU General Public License v3.0
0 stars 0 forks source link

Clean UP & Dockerization of eos9sa2 #1

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

Please check that the model is up to date with the new version of the template and all the workflows are in place.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron, I have forked and cloned the repo, did necessary changes and when i tried to run run.sh i got an error. In order to check whether the model is working before the changes, i cloned it again and run run_predict.sh and got the same error i.e.

Using TensorFlow backend. /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) processed array of 5 compounds in 0.083587 seconds 208 Traceback (most recent call last): File "/home/zakia/ersilia/eos9sa2/model/framework//code/main.py", line 40, in <module> x = _normalize(data, idx, True) File "/home/zakia/ersilia/eos9sa2/model/framework/model/data_preprocess.py", line 110, in _normalize x=np.nan_to_num((x-mu)/std) ValueError: operands could not be broadcast together with shapes (5,123) (208,)

I passes the input_file of 5 smiles string but the code also take some data from framework/model/data/ folder i.e. zinc15_nondrugs_sample_rdkit_mu.npz and zinc15_nondrugs_sample_rdkit_std. The smiles string after vectorization(converted into it's molecular representation using rdkit) have len(5,123) but the zinc data mean and std values have len(208,). So, due to conflict in lens the following operation gave me above error x=np.nan_to_num((x-mu)/std)

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

In order to solve this, you need to understand why the functions are giving different outputs. the best is for you to print the intermediate steps and try to understand the _normalize function for example, and identify where is the bug coming from. The shape of 208,0 a priori does not make sense if it is a vectorization of several molecules. Have you checked the datasets manually and tried to reproduce the steps? I suggest to try to get as much information as possible to show it to your mentor @DhanshreeA in tomorrow's meeting

ZakiaYahya commented 1 year ago

Hello @GemmaTuron,

Yes, i use print commands a lot to check how data molecules are converted into descriptors. the descriptor len is (5,123) because i'm passing five smiles as an input. The (208,) is not the len of descriptors, It's the mean and standard deviation vector length of already trained Zinc drugs+nondrugs dataset. They are using means and standard deviation of those already trained dataset to normalize the input data that we are passing to the model for prediction. I'm further digging into the code by seeking help from it's original github code and paper. I'll discuss it with @DhanshreeA in tomorrow's 1:1 meeting if the issue still resides. Thanks.

ZakiaYahya commented 1 year ago

I'm also trying to fetch model eos9sa2 from Ersilia Model Hub but the fetch fails there as well. I'm working on it. Thanks.

DhanshreeA commented 1 year ago

Hey @ZakiaYahya can you post the error log here for the failed fetching from ersilia?

ZakiaYahya commented 1 year ago

Hello @miquelduranfrigola @GemmaTuron @DhanshreeA I'm running into a problem of mismatched length while performing normalization of descriptors using mean and standard deviation values coming from already trained ZINC dataset. The length of mean and standard deviation is (208, ) while the length of smiles converted to descriptors using rdkit is (5,123) because i'm passing 5 smiles strings as an input. Now the question is that different rdkit has different descriptor lengths which I and @DhanshreeA finds out during 1:1 meeting today

(1) Original repo uses Rdkit 2017 which is now absolete and no longer available.

(2) Ersilia forked repo uses Rdkit 2021. Now in dockerfile it uses conda install -c conda-forge rdkit=2021.03.4 command to install rdkit but i've tried it a lot and it hangs at "collecting metadata". So, i installed the same version using Rdkit pypi. And that version has descriptor length are 123.

(3) Tried Edkit latest 2023 version and the length of descriptors are 209.

Problem: But i want descriptor len of 208 so it aligns with mean and standard deviation vector length for performing normalization. Should i tried varios versions and take one which has 208 len or we can do some changes in code that reduces or resize the len of mean and standard deviation according to smiles descriptor length. What is more appropriate to do?

DhanshreeA commented 1 year ago

Hey @ZakiaYahya like we discussed today, let's go through the RDKit documentation/release notes to figure out which version might have 208 descriptors. Also, can you confirm that when you fetch the model from ersilia it's failing with the same error?

ZakiaYahya commented 1 year ago

Right @DhanshreeA, i'll look into it. No, it's not failing at the same error , it fails due to some dependancy clashes. I'm reproducing the error as i didn't save it in a log file. Once it done, i'll share the log file with you.

ZakiaYahya commented 1 year ago

@DhanshreeA
I tried fetching it from ersilia again, but it got hangs at somewhere in between, here's the output log

eos9sa2_log.log

If i reproduced the error, i'll share it with you.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya,

Please can you explain what you mean by it got hangs at somewhere in between? Identify the step that is failing and try to trace back the origin of the error.

To help more, the model fetch in my MacBook failed, here I am attaching the error log, please check if you are getting the same error. I'll try to reproduce it in my Ubuntu machine as well out.log

ZakiaYahya commented 1 year ago

@GemmaTuron I try to fetch it multiple times and it didn't proceed after Running bash /tmp/ersilia-3i_ms40b/script.sh > /tmp/ersilia-ovftkhab/command_outputs.log 2>&1 although i leave it overnight as well. It seems like it stucks while installing dependancies. Now, i'm trying to fetch it on Colab, let's see what error it produces on Colab, will let you know.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

I suggest opening the temporal file to see what is going on. The temporal files are deleted after usage, so the best is while it is running, to open a second terminal and: vim /tmp/ersilia-ovftkhab/command_outputs.log #the tmp file name that is being created this will allow you to see in real time the error log as it is being created

GemmaTuron commented 1 year ago

Something to try as well is to manually reproduce the steps, completely outside Ersilia:

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

A few pointers. I have manually cloned the model repository, and created a conda environment with python 3.7. I have manually installed the requirements (changing the rdkit version because that old one was giving me problems). Before making any changes anywhere I try to run the model from the framework folder bash run_predict.sh . ~/test.csv ~/out.csv

Please reproduce this and take it from there

ZakiaYahya commented 1 year ago

Hello @GemmaTuron, I did the following steps to get the results (1) Yes, the old RDkit version causes probem so i installed it using pip pypi. (2) I have created conda environment and installed required packages as well. (3) But the code has some flaws, like in main.py it took means and standard deviation of zinc15_nondrugs_sample_rdkit but it should take the means and standard deviation vectors of smiles that we passed. I go through the original repo and they clearly said that we used zinc15_nondrugs_sample_rdkit means and standard deviations as an example. And these files are hardcoded inside the code not passed an argument. For your info i'm attaching that piece of screenshot

Image

(4) After passing the generated means and standard deviation vectors of the smiles that i'm passing as an input both using the original repo bash run_predict.sh . ~/test.csv ~/out.csv as well as using inside ersilia ersilia -v fetch eos9sa2 --repo_path /home/zakia/ersilia/eos9sa2 , the model is working. Here's the input and output results

output.csv input.csv

ZakiaYahya commented 1 year ago

@GemmaTuron I have a meeting with @DhanshreeA in a half an hour, once i discuss with her all the things, after that i'll do any changes. Thanks

ZakiaYahya commented 1 year ago

Hi @GemmaTuron How can we check or evalutae that the output producing from the model is correct? is there anyway? I want to check whether the output i'm getting is correct or not

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

Please, before merging these changes let's clarify about the normalization step. The normalization should always be done against the same parameters, so that the results of the different runs of the model are comparable. If I understand it correctly, you changed the normalization to the values of the smiles you are passing. I think we should use the normalization for the sample molecules that is provided in the original repository.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron, @DhanshreeA

(1) Yeh right, I had made changes just to check whether the model is working or not. In normalization step, they're using the mean and standard deviation of another dataset, i just use the mean and std of the samples i'm passing in normalization step.
(2) I'm now working on RDkit sample size issue as well to find out which RDkit version actually returns (208, ) size descriptor vector meanwhile waiting for @miquelduranfrigola because he said he knows how to tackle that issue. (3) Once I make sure whether is it RDkit issue or actually wrong passing of mean and std vector, i'll then make the pull request but before that i'll let you know Gemma. Thanks.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya About the rdkit version, I am confused since above you show that the model works, so is this solved? which rdkit version are you using?

ZakiaYahya commented 1 year ago

Hello @GemmaTuron Yes, i have found the version of RDkit which actually returns the descriptor length of (208, ). Without doing any changes in the code, with version of RDkit==2022.3.1b1 the code is working perfectly fine both locally and with ersilia --repo_path as well. Do i open the PR now?? I have a short meeting with @miquelduranfrigola today on the RDkit different versions issue. I'll discuss with him too.

GemmaTuron commented 1 year ago

Sure, it would be great if you can prepare a short explanation about the RDkit issue for your presentation on Thursday. Will be of great help for the other interns.

Regarding the normalization step, as I said earlier the normalisation MUST be done against the same parameters always, so the original code where they use the zinc_nondrugs dataset is correct, do not change it. Open the PR after meeting with Miquel and making sure all is working as expected.

ZakiaYahya commented 1 year ago

Yeh sure, i'll surely add some slides on that issue for sure. Yes Gemma, with this version of RDkit, i didn't made any change in the code not even a single line, the code is intact as i forked from ersilia. I just changes the version of RDkit in the dockerfile because the previous mentioned version is not working. Yes, once i discussed it with @miquelduranfrigola i'll open the PR for this model too because it's working on my side. Let's see.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA I had a meeting with @miquelduranfrigola and he suggested me to do following things, (1) Try to install RDkit 2017 in python 3.6 because it could install in python 3.6 version and then compare the descriptors from both versions (the one my code is working with that gave me length of 208). (2) Then take descriptor keys from RDkit 2017 and stores it in a text file. (3) Pass that saved descriptors keys to the model instead of taking it from newer version of RDkit.

I've tried a lot but didn't able to install it even in python 3.6, i have tried installing it with conda-forge and even with tar.gz but it is not installing. I'll try it again to install RDkit 2017 version.

Specifically with RDkit 2022 version my model is working perfectly fine.

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

We need to compare that the RDKit descriptors are the same between the 2017 and 2022 versions before modifying the model otherwise we might be running into major issues. Can you maybe check if another version (2018 for example) is working and use that for comparison?

ZakiaYahya commented 1 year ago

Hi @GemmaTuron Yeh i'm working on it. Trying to install RDkit 2017 on my system, not successfull yet , i'll try 2018 as well. Once the package is install i'll do the next steps suggested by @miquelduranfrigola. Thanks.

DhanshreeA commented 1 year ago

Hi @ZakiaYahya I'll share my investigation of this issue:

  1. Fetching the model from ersilia indeed fails as well for me both normally and while fetching through repo_path. I've tried fetching it multiple times with the following results:
  1. The paper mentions generating 200 descriptors from RDKit (page 2, para 2):

    Five different molecular representations were examined. The first four assigned a vector of descriptors to each molecule: (1) a set of 200 descriptors from the RDKit library[19], (2) a set of 777 Mold2 descriptors20; (3) a 2,048-bit ECFP4 fingerprint and (4) a binary vector, in which each bit denotes the presence or absence of one of 3,000 maximum common substructures (MCS) appearing most frequently in the drugs and non-drugs datasets.

These 200 descriptors might have come from RDKit 2017, however at the time this model was incorporated, it was retrained with RDKit 2021 version, which seems to have increased the number of descriptors. Interestingly, the 2023 version now has 123 descriptors. Now why this inconsistency exists between descriptors, I'm not quite sure, since the number of interesting physical properties of a small molecule shouldn't really vary. However it seems we are not the only ones facing this issue: https://github.com/rdkit/rdkit/discussions/6151 I think another thing to try out are the steps in this discussion @ZakiaYahya . I would recommend trying out the steps in this discussion before trying to install 2017 version because it's already causing problems.

  1. As for actually installing RDKit, if you do it with pip there's no way to install it with just pip for versions before 2020 as those don't seem to be indexed on PyPI. A tarball would definitely be the way to go. Another way you could try installing it would be to use the rdkit channel instead of conda-forge. Or use the tarballs from these conda channels and install via the conda command. I see that the rdkit channel is running quite compared to conda-forge.

Finally, some notes on comparing descriptors if you manage to get these versions installed. Check the kind of descriptors available across versions, and the actual values returned.

DhanshreeA commented 1 year ago

Finally, I have a concern with the description of this model, and this is addressed to both @GemmaTuron and @miquelduranfrigola. In the paper it seems the authors used five different featurizers: Mold2, RDKit, MCS, EXFP4, Mol2Vec and different models: variational autoencoders, graph conv nets, and simple logistic regression. We seem to have only retrained the VAE with rdkit descriptors, however this is not mentioned in the description explicitly.

To define drug-likeness, a set of 2136 approved drugs from DrugBank was taken as drug-like, and three negative datasets were selected from ZINC15 (19M), the Network of Organic Chemistry (6M) and ligands from the Protein Data Bank (13k), respectively. The drug dataset was combined with an equal subsampling of the negative dataset for each experiment, using five different molecular representations (Mold2, RDKit, MCS, EXFP4, Mol2Vec). We have re-trained it following the author’s specifications.

Should we change this too?

ZakiaYahya commented 1 year ago

Hello @DhanshreeA

Thanks for your detailed explanation. Okay, i'm working on it and following steps i'll take in this regard; (1) I'll surely check the link you shared and try it out as well. Maybe it works. (2) Trying to install version 2017 using conda-forge and somehow managed to install rdkit in python 3.6 environment. But when i tries to import the package, the python is not able to recognize the installed rdkit as it throughs an error "No module named rdkit". Although, i verify installation of RDkit 2017 by listing packages using "conda list". I'm trying to figure it out whether some path causes the problem or what. (3) If the above two solutions didn't workout, i'll try to install the required RDkit 2017 package in Colab and get the descriptor keys list. (4) In case, if all the above solutions fails then i'll go and seek help from some fellow intern to install the RDkit 2017 and provide me descriptor keys in .txt format, But i hope, i'll manage to make it work by myself.

Thanks.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @miquelduranfrigola @DhanshreeA Here's are few findings on RDkit issue

(1) In original Repo/Paper: they were using RDkit==2017.09.1 using conda-forge. Firstly it is absolete now, not offered by pip anymore but one can install it with conda-forge in python 3.6. We have checked that version and it's giving descriptor length = 200

(2) In Ersilia Forked Version: the contributor amna were somehow used RDkit==2021.03.4 instead of version 2017 using conda-forge, what i understand that she did training on Zinc data from scratch and get mean and std vectors of length 208 instead of 200 as mentioned in original paper/repo.

(3) What i did: Initially i tried to install the same version that Ersilia contributor used i.e. RDkit==2021.03.4 using conda-forge but i didn't able to install it with python 3.7. So, i installed a closer version of RDkit==2021.03.1 using pip-pypi. And it gave me descriptor length = 123 which varies from Ersilia forked repo as well as original repo. That means different RDkit versions have different length of descriptors.

After testing Rdkit==2017 version and it came out 200. So, i again tried to install the version which is mentioned in Ersilia Repo but this time i tried it installing in environment of python 3.6 instead of python 3.7 and i successfully installed the RDkit==2021.03.4 using conda-forge.
Now, i have installed the same version of RDkit from which the ersilia contributor actually got Zinc dataset means and std. So, the question of comparison of RDkit descriptors resolved. I'm installing other dependancies in the same environment that required for model eos9sa2 and once it done i'll run the model both locally and with --repo-path and if it is successfully done i'll open the PR.

@GemmaTuron , @DhanshreeA @miquelduranfrigola if anyone of you have still some suggestions regarding this RDkit matter, kindly let me know. I'll try to work on it. Many Thanks.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron and @miquelduranfrigola The model is working with the RDkit version used by ersilia contributor, should i open the PR?

DhanshreeA commented 1 year ago

(3) What i did: Initially i tried to install the same version that Ersilia contributor used i.e. RDkit==2021.03.4 using conda-forge but i didn't able to install it with python 3.7. So, i installed a closer version of RDkit==2021.03.1 using pip-pypi. And it gave me descriptor length = 123 which varies from Ersilia forked repo as well as original repo. That means different RDkit versions have different length of descriptors.

Hey @ZakiaYahya regarding point 3, could you check out this issue? https://github.com/rdkit/rdkit/discussions/6151 It seems that getting only 123 descriptors is an installation issue. However I am also okay with not spending more time on it if it's working right now.

Have you tested it with ersilia CLI and repo_path flag? If yes and everything looks great, please go ahead and create a PR. :hugs:

ZakiaYahya commented 1 year ago

Hi @DhanshreeA Yes i have checked that issue as well and it suggest to check th RDkit with this command python -c 'from rdkit import RDConfig;print(RDConfig.RDDataDir);from rdkit.Chem import Descriptors;print(len(Descriptors._descList))' and it exactly shows me the right path and same descriptors while in that forum the persons rdkit path is different.

Yes, it is working locally and inside Ersilia with --repo-path. Okay i'll commit the changes and made PR on it . Thanks

ZakiaYahya commented 1 year ago

Hi @GemmaTuron @miquelduranfrigola @DhanshreeA Previously i ran eos9sa2 inside ersilia but i forgot to edit dockerfile which i edited before for another version of rdkit and it havepip install rdkit-pypi==2022.3.1b1 in it, that's why its running within ersilia. Now while making final commits i realized that i didn't changepip install rdkit-pypi==2022.3.1b1 toconda install -c conda-forge rdkit=2021.03.4 and when i changed this , it is running successfully locally in python 3.6 environment but when i tries to run it within Ersilia it fails No module named "rdkit" because ersilia environment is running on python 3.7 and conda install -c conda-forge rdkit=2021.03.4 is not working in python 3.7

Any Suggestions on how to define it in dockerfile because the whole model is working in python 3.6 but not in python 3.7, hence failing within ersilia.

One solution would be the one that @miquelduranfrigola suggested, i'll store the descriptor keys from python 3.6 conda install -c conda-forge rdkit=2021.03.4 and used it in python 3.7 and check whether is it working or not but the question is then which version i will mention in dockerfile?? Because although we pass the keys but we still need RDkit dependancy.

Second solution would be that we compare the descriptor keys from pip install rdkit-pypi==2022.3.1b1 and conda install -c conda-forge rdkit=2021.03.4 if both are same, then i'll mention pip install rdkit-pypi==2022.3.1b1 in dockerfile as it is working with python 3.7.

Let me work on these solutions, i'll update on this soon

ZakiaYahya commented 1 year ago

I've compared the descriptor keys of both pip install rdkit-pypi==2022.3.1b1 and conda install -c conda-forge rdkit=2021.03.4 and it comes out same with keys as well as descriptor length i.e. 208. So, i'm adding pip install rdkit-pypi==2022.3.1b1 in dockerfile as it is compatible with python 3.7 and successfully installing within ersilia as well. Doing final commits now and then will made PR on it.

For reference i'm attaching the descriptor keys from both versions.

RDkit_conda.txt RDkit_pypi.txt

GemmaTuron commented 1 year ago

Fantastic @ZakiaYahya

Great work on getting this right. Let me know when the PR is ready for merging!

ZakiaYahya commented 1 year ago

Hi @GemmaTuron I have pushed the changes and made the PR on it. Kindly check it.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @DhanshreeA The ARM64 build is failing at "ERROR: No matching distribution found for tensorflow==1.13.2" in github Actions while AMD64 build is successful. I'm working on it why it is failing with tensorflow.

ZakiaYahya commented 1 year ago

Hello @GemmaTuron As upload to docker.yml is updated and working as expected, should i change the workflow file and made PR on this model again?? So, that its image is uploaded to DockerHub?? What you say?

GemmaTuron commented 1 year ago

Hi @ZakiaYahya

We will discuss that in the general meeting today

miquelduranfrigola commented 1 year ago

For the meeting discussion, let's monitor the actions: https://github.com/ersilia-os/eos9sa2/actions/

ZakiaYahya commented 1 year ago

Hello @miquelduranfrigola It's working 👍

miquelduranfrigola commented 1 year ago

@GemmaTuron this model seems to be resolved now - please check

GemmaTuron commented 1 year ago

Hi @miquelduranfrigola and @ZakiaYahya The output to csv seems to be failing, please check the model testing issue and ammend. Thanks

ZakiaYahya commented 1 year ago

@GemmaTuron and @miquelduranfrigola I've checked it again in my system and it is working perfectly, and it's working for @emmakodes as well. Let me check why it's failing for some users. Thanks.

miquelduranfrigola commented 1 year ago

Thanks @ZakiaYahya - let us know if you need anything from us at this stage

ZakiaYahya commented 1 year ago

Thanks @miquelduranfrigola I'll surely let you know :)