Closed GemmaTuron closed 1 year ago
Hello @GemmaTuron, The Colab log file and Model Test on push(which failed) show that rdkit didn't install because of conflicts. So when I cloned the repo, I changed it to install using pip.
However, on fetching the model locally, the log didn't show the exact error except Status code: 500
. eos2thm_fetch_repo.log
17:49:49 | DEBUG | Meta: None
17:49:49 | DEBUG | Posting to predict
17:49:49 | DEBUG | Batch size 100
17:49:49 | DEBUG | Schema not yet available
17:50:09 | DEBUG | Status code: 500
17:50:09 | ERROR | Status Code: 500
17:50:09 | WARNING | Batch prediction didn't seem to work. Doing predictions one by one...
17:50:15 | DEBUG | Status code: 500
17:50:15 | ERROR | Status Code: 500
17:50:21 | DEBUG | Status code: 500
17:50:21 | ERROR | Status Code: 500
When I tested with the predict.py, I got an error of size mismatch
(eos2thm) hellenah@hellenah-elitebook:~/Outreachy/eos2thm/model/framework$ python predict.py ~/test.csv out.csv
INFO: Setting num_physchem_properties to 123.
INFO: Setting num_physchem_properties to 123.
Traceback (most recent call last):
File "predict.py", line 14, in <module>
mdl = MolBertFeaturizer(path_to_checkpoint)
File "/home/hellenah/Outreachy/eos2thm/model/framework/molbert/utils/featurizer/molbert_featurizer.py", line 66, in __init__
self.model.load_from_checkpoint(self.checkpoint_path, hparam_overrides=self.model.__dict__)
File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 169, in load_from_checkpoint
model = cls._load_model_state(checkpoint, *args, **kwargs)
File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 207, in _load_model_state
model.load_state_dict(checkpoint['state_dict'])
File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SmilesMolbertModel:
size mismatch for model.tasks.1.physchem_head.physchem_clf.3.weight: copying a param with shape torch.Size([200, 768]) from checkpoint, the shape in current model is torch.Size([123, 768]).
size mismatch for model.tasks.1.physchem_head.physchem_clf.3.bias: copying a param with shape torch.Size([200]) from checkpoint, the shape in current model is torch.Size([123]).
I suspected this to be an issue with the length of descriptors for rdkit 2021, because the original authors used rdkit 2019.
Type "help", "copyright", "credits" or "license" for more information.
>>> import rdkit
>>> rdkit.__version__
'2021.03.1'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
123
>>>
Any rdkit version whose length of descriptors is lessthan 200 won't work. 2021.03.2
>>> import rdkit
>>> rdkit.__version__
'2021.03.2'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
115
>>>
rdkit 2020.09.5 couldn't work either
>>> import rdkit
>>> rdkit.__version__
'2020.09.5'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
123
>>>
So, I tried to find the best version of rdkit. While for versions; 2021.03.4, 2021.09.1, 2022.09.1, 2022.09.5, descriptors have a length of 208, it so happens that they worked.
>>> import rdkit
>>> rdkit.__version__
'2022.09.5'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
208
>>>
The new output when using a length of 208 is the same as the original ouput. (the data.csv and pred.csv are in the repository) Test data: data.csv Orignal output: pred.csv New output: test_output.csv
I understand this may be trunctating the length to 200 instead of 208. But it really works if we compared the output. However, If you know a specific rdkit version above 2020(it's nolonger posible to install those below 2020.09.5) whose descriptors have a length of 200, that will be the best.
For now am taking 2022.09.5
.
Thanks Hellen for this detailed summary. There is a major problem in doing that, because if the set of descriptors does not match the original set of descriptors used we might be getting incorrect predictions (i.e, if position 0 in the descriptor was Molecular Weight, it should still be that)
@miquelduranfrigola and @ZakiaYahya can you have a look? how did we resolve this issue previously? Do we have the set of descriptors for previous versions so that we can compare
Hi @GemmaTuron, I set up an environment with packages specified in the original repo. I had to downgrade conda inorder to install rdkit 2019 with 200 descriptors(messes up other conda packages). Comparing with rdkit 2022 which has 208 descriptors, the 8 which are missing in 2019 are;
Missing descriptors in rdkit_2019_03_1_descriptors:
Line 18: BCUT2D_MWHI
Line 19: BCUT2D_MWLOW
Line 20: BCUT2D_CHGHI
Line 21: BCUT2D_CHGLO
Line 22: BCUT2D_LOGPHI
Line 23: BCUT2D_LOGPLOW
Line 24: BCUT2D_MRHI
Line 25: BCUT2D_MRLOW
Regarding the model output; while the 768 embeddings returned by the model don't have feature labels, they are the same for rdkit versions with 200+ descriptors. (compared these with output from original repo)
Hello @GemmaTuron @HellenNamulinda
In my case the original source repo used rdkit 2017
with conda-forge
and for that i needed the 208
descriptors, the 2017 version is obsolete now but can be install in python-3.6 but not on python-3.7. But i found the rdkit version 2022.3.1b1
which has the same len of descriptors plus same sequence of them as well, so i used the newer 2022 version instead of 2017 version.
Thanks @ZakiaYahya That is very helpful. @HellenNamulinda we could try that version and check again that the descriptors given are the same and the output of the original vs new model coincide? Thanks both!
Hi @GemmaTuron,
As commented here above,
For all the versions checked sofar, only rdkit 19 has 200 descriptors. Most have 208, others 123 and 115. the latest rdkit 2023 has 209.
I attached the output of the original model(rdkit 2019.03.1) and this model's output(rdkit 2022.09.5), While the output is the same, I highlighted the 8 descriptors missing in 2019
I'm not so sure how rdkit 2021.03.1
which is in the docker file worked before. It has 123 and the model only acccepts those with 200+
Also, was the pred.csv in the frameworks folder provided by the model authors? @miquelduranfrigola
Hi @HellenNamulinda
I don't know how the model can give the exact same output when passing different descriptors.. what is the position of the 8 missing descriptors in rdkit 2019? is at the end of the list or mixed in between? @miquelduranfrigola we need to be sure about these changes - potentially very dangerous
Hi @GemmaTuron, I set up an environment with packages specified in the original repo. I had to downgrade conda inorder to install rdkit 2019 with 200 descriptors(messes up other conda packages). Comparing with rdkit 2022 which has 208 descriptors, the 8 which are missing in 2019 are;
Missing descriptors in rdkit_2019_03_1_descriptors: Line 18: BCUT2D_MWHI Line 19: BCUT2D_MWLOW Line 20: BCUT2D_CHGHI Line 21: BCUT2D_CHGLO Line 22: BCUT2D_LOGPHI Line 23: BCUT2D_LOGPLOW Line 24: BCUT2D_MRHI Line 25: BCUT2D_MRLOW
Regarding the model output; while the 768 embeddings returned by the model don't have feature labels, they are the same for rdkit versions with 200+ descriptors. (compared these with output from original repo)
- Test file: data.csv
- Original code(rdkit 2019)_200 descriptors: molbert_output.csv
- Model output(rdkit 2022)_208 descriptors: test_output.csv
The missing descriptors are from 18 to 25
Missing descriptors in rdkit_2019_03_1_descriptors:
Line 18: BCUT2D_MWHI
Line 19: BCUT2D_MWLOW
Line 20: BCUT2D_CHGHI
Line 21: BCUT2D_CHGLO
Line 22: BCUT2D_LOGPHI
Line 23: BCUT2D_LOGPLOW
Line 24: BCUT2D_MRHI
Line 25: BCUT2D_MRLOW
@GemmaTuron and @miquelduranfrigola So, here is something interesting I just found out and the reason why all rdkit versions with 200+ descriptors work. The 200 descriptors in rdkit 2019 were divided into subsets that's; simple_descriptors, refractivity_descriptors and others. These can be seen in the file model/framework/molbert/utils/featurizer/molfeaturizer.py
So, It so happens that rdkit versions with 200+ have these descriptors that were explicitly specified.
I see, this is quite interesting, thanks @HellenNamulinda
I agree with @GemmaTuron that this is potentially very dangerous, so thanks for taking the time to look into this with so much detail, @HellenNamulinda
Since in molbert the fingerprints were explicitly specified (molfeaturizer), I see no danger in upgrading rdkit.
Hello @GemmaTuron and @miquelduranfrigola, I understand the danger. And am glad we now why the output for the different versions is the same.
So I think we can open a PR and merge the changes? Thanks for the work Hellen!
Hi @GemmaTuron, I have created a PR
Hi @Gemma, This model happens not to be working. I tried it using Colab; eos2thm_colab_fetch.log
And git clone has been taking long.
I just checked and realized the checkpoitnt is almost 1GB; https://github.com/ersilia-os/eos2thm/blob/main/model/checkpoints/molbert_100epochs/checkpoints/last.ckpt So, I will give it time to download