Closed GemmaTuron closed 1 year ago
Hello @GemmaTuron,
I have forked and cloned the repo, did necessary changes and when i tried to run run.sh
i got an error. In order to check whether the model is working before the changes, i cloned it again and run run_predict.sh
and got the same error i.e.
Using TensorFlow backend. /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/zakia/miniconda3/envs/model2/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) processed array of 5 compounds in 0.083587 seconds 208 Traceback (most recent call last): File "/home/zakia/ersilia/eos9sa2/model/framework//code/main.py", line 40, in <module> x = _normalize(data, idx, True) File "/home/zakia/ersilia/eos9sa2/model/framework/model/data_preprocess.py", line 110, in _normalize x=np.nan_to_num((x-mu)/std) ValueError: operands could not be broadcast together with shapes (5,123) (208,)
I passes the input_file of 5 smiles string but the code also take some data from framework/model/data/ folder i.e. zinc15_nondrugs_sample_rdkit_mu.npz and zinc15_nondrugs_sample_rdkit_std. The smiles string after vectorization(converted into it's molecular representation using rdkit) have len(5,123) but the zinc data mean and std values have len(208,). So, due to conflict in lens the following operation gave me above error
x=np.nan_to_num((x-mu)/std)
Hi @ZakiaYahya
In order to solve this, you need to understand why the functions are giving different outputs. the best is for you to print the intermediate steps and try to understand the _normalize function for example, and identify where is the bug coming from. The shape of 208,0 a priori does not make sense if it is a vectorization of several molecules. Have you checked the datasets manually and tried to reproduce the steps? I suggest to try to get as much information as possible to show it to your mentor @DhanshreeA in tomorrow's meeting
Hello @GemmaTuron,
Yes, i use print commands a lot to check how data molecules are converted into descriptors. the descriptor len is (5,123) because i'm passing five smiles as an input. The (208,) is not the len of descriptors, It's the mean and standard deviation vector length of already trained Zinc drugs+nondrugs dataset. They are using means and standard deviation of those already trained dataset to normalize the input data that we are passing to the model for prediction. I'm further digging into the code by seeking help from it's original github code and paper. I'll discuss it with @DhanshreeA in tomorrow's 1:1 meeting if the issue still resides. Thanks.
I'm also trying to fetch model eos9sa2 from Ersilia Model Hub but the fetch fails there as well. I'm working on it. Thanks.
Hey @ZakiaYahya can you post the error log here for the failed fetching from ersilia?
Hello @miquelduranfrigola @GemmaTuron @DhanshreeA I'm running into a problem of mismatched length while performing normalization of descriptors using mean and standard deviation values coming from already trained ZINC dataset. The length of mean and standard deviation is (208, ) while the length of smiles converted to descriptors using rdkit is (5,123) because i'm passing 5 smiles strings as an input. Now the question is that different rdkit has different descriptor lengths which I and @DhanshreeA finds out during 1:1 meeting today
(1) Original repo uses Rdkit 2017 which is now absolete and no longer available.
(2) Ersilia forked repo uses Rdkit 2021. Now in dockerfile it uses conda install -c conda-forge rdkit=2021.03.4
command to install rdkit but i've tried it a lot and it hangs at "collecting metadata". So, i installed the same version using Rdkit pypi. And that version has descriptor length are 123.
(3) Tried Edkit latest 2023 version and the length of descriptors are 209.
Problem: But i want descriptor len of 208 so it aligns with mean and standard deviation vector length for performing normalization. Should i tried varios versions and take one which has 208 len or we can do some changes in code that reduces or resize the len of mean and standard deviation according to smiles descriptor length. What is more appropriate to do?
Hey @ZakiaYahya like we discussed today, let's go through the RDKit documentation/release notes to figure out which version might have 208 descriptors. Also, can you confirm that when you fetch the model from ersilia it's failing with the same error?
Right @DhanshreeA, i'll look into it. No, it's not failing at the same error , it fails due to some dependancy clashes. I'm reproducing the error as i didn't save it in a log file. Once it done, i'll share the log file with you.
@DhanshreeA
I tried fetching it from ersilia again, but it got hangs at somewhere in between, here's the output log
If i reproduced the error, i'll share it with you.
Hi @ZakiaYahya,
Please can you explain what you mean by it got hangs at somewhere in between? Identify the step that is failing and try to trace back the origin of the error.
To help more, the model fetch in my MacBook failed, here I am attaching the error log, please check if you are getting the same error. I'll try to reproduce it in my Ubuntu machine as well out.log
@GemmaTuron
I try to fetch it multiple times and it didn't proceed after Running bash /tmp/ersilia-3i_ms40b/script.sh > /tmp/ersilia-ovftkhab/command_outputs.log 2>&1
although i leave it overnight as well. It seems like it stucks while installing dependancies. Now, i'm trying to fetch it on Colab, let's see what error it produces on Colab, will let you know.
Hi @ZakiaYahya
I suggest opening the temporal file to see what is going on. The temporal files are deleted after usage, so the best is while it is running, to open a second terminal and:
vim /tmp/ersilia-ovftkhab/command_outputs.log
#the tmp file name that is being created
this will allow you to see in real time the error log as it is being created
Something to try as well is to manually reproduce the steps, completely outside Ersilia:
Hi @ZakiaYahya
A few pointers. I have manually cloned the model repository, and created a conda environment with python 3.7.
I have manually installed the requirements (changing the rdkit version because that old one was giving me problems). Before making any changes anywhere I try to run the model from the framework folder
bash run_predict.sh . ~/test.csv ~/out.csv
Please reproduce this and take it from there
Hello @GemmaTuron, I did the following steps to get the results (1) Yes, the old RDkit version causes probem so i installed it using pip pypi. (2) I have created conda environment and installed required packages as well. (3) But the code has some flaws, like in main.py it took means and standard deviation of zinc15_nondrugs_sample_rdkit but it should take the means and standard deviation vectors of smiles that we passed. I go through the original repo and they clearly said that we used zinc15_nondrugs_sample_rdkit means and standard deviations as an example. And these files are hardcoded inside the code not passed an argument. For your info i'm attaching that piece of screenshot
(4) After passing the generated means and standard deviation vectors of the smiles that i'm passing as an input
both using the original repo bash run_predict.sh . ~/test.csv ~/out.csv
as well as
using inside ersilia ersilia -v fetch eos9sa2 --repo_path /home/zakia/ersilia/eos9sa2
, the model is working. Here's
the input and output results
@GemmaTuron I have a meeting with @DhanshreeA in a half an hour, once i discuss with her all the things, after that i'll do any changes. Thanks
Hi @GemmaTuron How can we check or evalutae that the output producing from the model is correct? is there anyway? I want to check whether the output i'm getting is correct or not
Hi @ZakiaYahya
Please, before merging these changes let's clarify about the normalization step. The normalization should always be done against the same parameters, so that the results of the different runs of the model are comparable. If I understand it correctly, you changed the normalization to the values of the smiles you are passing. I think we should use the normalization for the sample molecules that is provided in the original repository.
Hello @GemmaTuron, @DhanshreeA
(1) Yeh right, I had made changes just to check whether the model is working or not. In normalization step, they're
using the mean and standard deviation of another dataset, i just use the mean and std of the samples i'm passing
in normalization step.
(2) I'm now working on RDkit sample size issue as well to find out which RDkit version actually returns (208, ) size
descriptor vector meanwhile waiting for @miquelduranfrigola because he said he knows how to tackle that issue.
(3) Once I make sure whether is it RDkit issue or actually wrong passing of mean and std vector, i'll then make the pull
request but before that i'll let you know Gemma.
Thanks.
Hi @ZakiaYahya About the rdkit version, I am confused since above you show that the model works, so is this solved? which rdkit version are you using?
Hello @GemmaTuron Yes, i have found the version of RDkit which actually returns the descriptor length of (208, ). Without doing any changes in the code, with version of RDkit==2022.3.1b1 the code is working perfectly fine both locally and with ersilia --repo_path as well. Do i open the PR now?? I have a short meeting with @miquelduranfrigola today on the RDkit different versions issue. I'll discuss with him too.
Sure, it would be great if you can prepare a short explanation about the RDkit issue for your presentation on Thursday. Will be of great help for the other interns.
Regarding the normalization step, as I said earlier the normalisation MUST be done against the same parameters always, so the original code where they use the zinc_nondrugs
dataset is correct, do not change it. Open the PR after meeting with Miquel and making sure all is working as expected.
Yeh sure, i'll surely add some slides on that issue for sure. Yes Gemma, with this version of RDkit, i didn't made any change in the code not even a single line, the code is intact as i forked from ersilia. I just changes the version of RDkit in the dockerfile because the previous mentioned version is not working. Yes, once i discussed it with @miquelduranfrigola i'll open the PR for this model too because it's working on my side. Let's see.
Hello @GemmaTuron @DhanshreeA I had a meeting with @miquelduranfrigola and he suggested me to do following things, (1) Try to install RDkit 2017 in python 3.6 because it could install in python 3.6 version and then compare the descriptors from both versions (the one my code is working with that gave me length of 208). (2) Then take descriptor keys from RDkit 2017 and stores it in a text file. (3) Pass that saved descriptors keys to the model instead of taking it from newer version of RDkit.
I've tried a lot but didn't able to install it even in python 3.6, i have tried installing it with conda-forge and even with tar.gz but it is not installing. I'll try it again to install RDkit 2017 version.
Specifically with RDkit 2022 version my model is working perfectly fine.
Hi @ZakiaYahya
We need to compare that the RDKit descriptors are the same between the 2017 and 2022 versions before modifying the model otherwise we might be running into major issues. Can you maybe check if another version (2018 for example) is working and use that for comparison?
Hi @GemmaTuron Yeh i'm working on it. Trying to install RDkit 2017 on my system, not successfull yet , i'll try 2018 as well. Once the package is install i'll do the next steps suggested by @miquelduranfrigola. Thanks.
Hi @ZakiaYahya I'll share my investigation of this issue:
EmptyOutputError
from the model (because rdkit didn't get installed as was evident from an import error for it.)Five different molecular representations were examined. The first four assigned a vector of descriptors to each molecule: (1) a set of 200 descriptors from the RDKit library[19], (2) a set of 777 Mold2 descriptors20; (3) a 2,048-bit ECFP4 fingerprint and (4) a binary vector, in which each bit denotes the presence or absence of one of 3,000 maximum common substructures (MCS) appearing most frequently in the drugs and non-drugs datasets.
These 200 descriptors might have come from RDKit 2017, however at the time this model was incorporated, it was retrained with RDKit 2021 version, which seems to have increased the number of descriptors. Interestingly, the 2023 version now has 123 descriptors. Now why this inconsistency exists between descriptors, I'm not quite sure, since the number of interesting physical properties of a small molecule shouldn't really vary. However it seems we are not the only ones facing this issue: https://github.com/rdkit/rdkit/discussions/6151 I think another thing to try out are the steps in this discussion @ZakiaYahya . I would recommend trying out the steps in this discussion before trying to install 2017 version because it's already causing problems.
Finally, some notes on comparing descriptors if you manage to get these versions installed. Check the kind of descriptors available across versions, and the actual values returned.
Finally, I have a concern with the description of this model, and this is addressed to both @GemmaTuron and @miquelduranfrigola. In the paper it seems the authors used five different featurizers: Mold2, RDKit, MCS, EXFP4, Mol2Vec and different models: variational autoencoders, graph conv nets, and simple logistic regression. We seem to have only retrained the VAE with rdkit descriptors, however this is not mentioned in the description explicitly.
To define drug-likeness, a set of 2136 approved drugs from DrugBank was taken as drug-like, and three negative datasets were selected from ZINC15 (19M), the Network of Organic Chemistry (6M) and ligands from the Protein Data Bank (13k), respectively. The drug dataset was combined with an equal subsampling of the negative dataset for each experiment, using five different molecular representations (Mold2, RDKit, MCS, EXFP4, Mol2Vec). We have re-trained it following the author’s specifications.
Should we change this too?
Hello @DhanshreeA
Thanks for your detailed explanation. Okay, i'm working on it and following steps i'll take in this regard; (1) I'll surely check the link you shared and try it out as well. Maybe it works. (2) Trying to install version 2017 using conda-forge and somehow managed to install rdkit in python 3.6 environment. But when i tries to import the package, the python is not able to recognize the installed rdkit as it throughs an error "No module named rdkit". Although, i verify installation of RDkit 2017 by listing packages using "conda list". I'm trying to figure it out whether some path causes the problem or what. (3) If the above two solutions didn't workout, i'll try to install the required RDkit 2017 package in Colab and get the descriptor keys list. (4) In case, if all the above solutions fails then i'll go and seek help from some fellow intern to install the RDkit 2017 and provide me descriptor keys in .txt format, But i hope, i'll manage to make it work by myself.
Thanks.
Hello @GemmaTuron @miquelduranfrigola @DhanshreeA Here's are few findings on RDkit issue
(1) In original Repo/Paper: they were using RDkit==2017.09.1
using conda-forge
. Firstly it is absolete now, not offered by pip anymore but one can install it with conda-forge in python 3.6. We have checked that version and it's giving descriptor length = 200
(2) In Ersilia Forked Version: the contributor amna were somehow used RDkit==2021.03.4
instead of version 2017 using conda-forge
, what i understand that she did training on Zinc data from scratch and get mean and std vectors of length 208
instead of 200
as mentioned in original paper/repo.
(3) What i did: Initially i tried to install the same version that Ersilia contributor used i.e. RDkit==2021.03.4
using conda-forge
but i didn't able to install it with python 3.7. So, i installed a closer version of RDkit==2021.03.1
using pip-pypi
. And it gave me descriptor length = 123
which varies from Ersilia forked repo as well as original repo. That means different RDkit versions have different length of descriptors.
After testing Rdkit==2017 version and it came out 200. So, i again tried to install the version which is mentioned in Ersilia Repo but this time i tried it installing in environment of python 3.6 instead of python 3.7 and i successfully installed the RDkit==2021.03.4
using conda-forge
.
Now, i have installed the same version of RDkit from which the ersilia contributor actually got Zinc dataset means and std. So, the question of comparison of RDkit descriptors resolved. I'm installing other dependancies in the same environment that required for model eos9sa2 and once it done i'll run the model both locally and with --repo-path and if it is successfully done i'll open the PR.
@GemmaTuron , @DhanshreeA @miquelduranfrigola if anyone of you have still some suggestions regarding this RDkit matter, kindly let me know. I'll try to work on it. Many Thanks.
Hello @GemmaTuron and @miquelduranfrigola The model is working with the RDkit version used by ersilia contributor, should i open the PR?
(3) What i did: Initially i tried to install the same version that Ersilia contributor used i.e.
RDkit==2021.03.4
usingconda-forge
but i didn't able to install it with python 3.7. So, i installed a closer version ofRDkit==2021.03.1
usingpip-pypi
. And it gave medescriptor length = 123
which varies from Ersilia forked repo as well as original repo. That means different RDkit versions have different length of descriptors.
Hey @ZakiaYahya regarding point 3, could you check out this issue? https://github.com/rdkit/rdkit/discussions/6151 It seems that getting only 123 descriptors is an installation issue. However I am also okay with not spending more time on it if it's working right now.
Have you tested it with ersilia CLI and repo_path flag? If yes and everything looks great, please go ahead and create a PR. :hugs:
Hi @DhanshreeA
Yes i have checked that issue as well and it suggest to check th RDkit with this command python -c 'from rdkit import RDConfig;print(RDConfig.RDDataDir);from rdkit.Chem import Descriptors;print(len(Descriptors._descList))'
and it exactly shows me the right path and same descriptors while in that forum the persons rdkit path is different.
Yes, it is working locally and inside Ersilia with --repo-path. Okay i'll commit the changes and made PR on it . Thanks
Hi @GemmaTuron @miquelduranfrigola @DhanshreeA
Previously i ran eos9sa2 inside ersilia but i forgot to edit dockerfile which i edited before for another version of rdkit and it havepip install rdkit-pypi==2022.3.1b1
in it, that's why its running within ersilia. Now while making final commits i realized that i didn't changepip install rdkit-pypi==2022.3.1b1
toconda install -c conda-forge rdkit=2021.03.4
and when i changed this , it is running successfully locally in python 3.6 environment but when i tries to run it within Ersilia it fails No module named "rdkit"
because ersilia environment is running on python 3.7 and conda install -c conda-forge rdkit=2021.03.4
is not working in python 3.7
Any Suggestions on how to define it in dockerfile because the whole model is working in python 3.6 but not in python 3.7, hence failing within ersilia.
One solution would be the one that @miquelduranfrigola suggested, i'll store the descriptor keys from python 3.6 conda install -c conda-forge rdkit=2021.03.4
and used it in python 3.7 and check whether is it working or not but the question is then which version i will mention in dockerfile?? Because although we pass the keys but we still need RDkit dependancy.
Second solution would be that we compare the descriptor keys from pip install rdkit-pypi==2022.3.1b1
and conda install -c conda-forge rdkit=2021.03.4
if both are same, then i'll mention pip install rdkit-pypi==2022.3.1b1
in dockerfile as it is working with python 3.7.
Let me work on these solutions, i'll update on this soon
I've compared the descriptor keys of both pip install rdkit-pypi==2022.3.1b1
and conda install -c conda-forge rdkit=2021.03.4
and it comes out same with keys as well as descriptor length i.e. 208. So, i'm adding pip install rdkit-pypi==2022.3.1b1
in dockerfile as it is compatible with python 3.7 and successfully installing within ersilia as well. Doing final commits now and then will made PR on it.
For reference i'm attaching the descriptor keys from both versions.
Fantastic @ZakiaYahya
Great work on getting this right. Let me know when the PR is ready for merging!
Hi @GemmaTuron I have pushed the changes and made the PR on it. Kindly check it.
Hello @GemmaTuron @DhanshreeA The ARM64 build is failing at "ERROR: No matching distribution found for tensorflow==1.13.2" in github Actions while AMD64 build is successful. I'm working on it why it is failing with tensorflow.
Hello @GemmaTuron
As upload to docker.yml
is updated and working as expected, should i change the workflow file and made PR on this model again?? So, that its image is uploaded to DockerHub?? What you say?
Hi @ZakiaYahya
We will discuss that in the general meeting today
For the meeting discussion, let's monitor the actions: https://github.com/ersilia-os/eos9sa2/actions/
Hello @miquelduranfrigola It's working 👍
@GemmaTuron this model seems to be resolved now - please check
Hi @miquelduranfrigola and @ZakiaYahya The output to csv seems to be failing, please check the model testing issue and ammend. Thanks
@GemmaTuron and @miquelduranfrigola I've checked it again in my system and it is working perfectly, and it's working for @emmakodes as well. Let me check why it's failing for some users. Thanks.
Thanks @ZakiaYahya - let us know if you need anything from us at this stage
Thanks @miquelduranfrigola I'll surely let you know :)
Hi @ZakiaYahya
Please check that the model is up to date with the new version of the template and all the workflows are in place.