Clean UP & Dockerization eos2thm

HellenNamulinda commented 1 year ago

Hi @Gemma, This model happens not to be working. I tried it using Colab; eos2thm_colab_fetch.log

Model API eos2thm:predict did not produce an output/root/eos/repository/eos2thm/20230704093754_649836/eos2thm/artifacts/framework/predict.py: line 1: import: command not found
/root/eos/repository/eos2thm/20230704093754_649836/eos2thm/artifacts/framework/predict.py: line 2: import: command not found
/root/eos/repository/eos2thm/20230704093754_649836/eos2thm/artifacts/framework/predict.py: line 3: import: command not found
/root/eos/repository/eos2thm/20230704093754_649836/eos2thm/artifacts/framework/predict.py: line 5: syntax error near unexpected token `"."'
/root/eos/repository/eos2thm/20230704093754_649836/eos2thm/artifacts/framework/predict.py: line 5: `sys.path.append(".")'

And git clone has been taking long.

hellenah@hellenah-elitebook:~/Outreachy$ git clone https://github.com/HellenNamulinda/eos2thm.git
Cloning into 'eos2thm'...
remote: Enumerating objects: 133, done.
remote: Counting objects: 100% (133/133), done.
remote: Compressing objects: 100% (104/104), done.
remote: Total 133 (delta 27), reused 104 (delta 14), pack-reused 0
Receiving objects: 100% (133/133), 97.08 KiB | 345.00 KiB/s, done.
Resolving deltas: 100% (27/27), done.

I just checked and realized the checkpoitnt is almost 1GB; https://github.com/ersilia-os/eos2thm/blob/main/model/checkpoints/molbert_100epochs/checkpoints/last.ckpt So, I will give it time to download

HellenNamulinda commented 1 year ago

Hello @GemmaTuron, The Colab log file and Model Test on push(which failed) show that rdkit didn't install because of conflicts. So when I cloned the repo, I changed it to install using pip.

However, on fetching the model locally, the log didn't show the exact error except Status code: 500. eos2thm_fetch_repo.log

17:49:49 | DEBUG    | Meta: None
17:49:49 | DEBUG    | Posting to predict
17:49:49 | DEBUG    | Batch size 100
17:49:49 | DEBUG    | Schema not yet available
17:50:09 | DEBUG    | Status code: 500
17:50:09 | ERROR    | Status Code: 500
17:50:09 | WARNING  | Batch prediction didn't seem to work. Doing predictions one by one...
17:50:15 | DEBUG    | Status code: 500
17:50:15 | ERROR    | Status Code: 500
17:50:21 | DEBUG    | Status code: 500
17:50:21 | ERROR    | Status Code: 500

When I tested with the predict.py, I got an error of size mismatch

(eos2thm) hellenah@hellenah-elitebook:~/Outreachy/eos2thm/model/framework$ python predict.py ~/test.csv out.csv
INFO: Setting num_physchem_properties to 123.
INFO: Setting num_physchem_properties to 123.
Traceback (most recent call last):
  File "predict.py", line 14, in <module>
    mdl = MolBertFeaturizer(path_to_checkpoint)
  File "/home/hellenah/Outreachy/eos2thm/model/framework/molbert/utils/featurizer/molbert_featurizer.py", line 66, in __init__
    self.model.load_from_checkpoint(self.checkpoint_path, hparam_overrides=self.model.__dict__)
  File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 169, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, *args, **kwargs)
  File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 207, in _load_model_state
    model.load_state_dict(checkpoint['state_dict'])
  File "/home/hellenah/anaconda3/envs/eos2thm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SmilesMolbertModel:
        size mismatch for model.tasks.1.physchem_head.physchem_clf.3.weight: copying a param with shape torch.Size([200, 768]) from checkpoint, the shape in current model is torch.Size([123, 768]).
        size mismatch for model.tasks.1.physchem_head.physchem_clf.3.bias: copying a param with shape torch.Size([200]) from checkpoint, the shape in current model is torch.Size([123]).

I suspected this to be an issue with the length of descriptors for rdkit 2021, because the original authors used rdkit 2019.

Type "help", "copyright", "credits" or "license" for more information.
>>> import rdkit
>>> rdkit.__version__
'2021.03.1'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
123
>>>

Any rdkit version whose length of descriptors is lessthan 200 won't work. 2021.03.2

>>> import rdkit
>>> rdkit.__version__
'2021.03.2'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
115
>>>

rdkit 2020.09.5 couldn't work either

>>> import rdkit
>>> rdkit.__version__
'2020.09.5'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
123
>>>

So, I tried to find the best version of rdkit. While for versions; 2021.03.4, 2021.09.1, 2022.09.1, 2022.09.5, descriptors have a length of 208, it so happens that they worked.

>>> import rdkit
>>> rdkit.__version__
'2022.09.5'
>>> from rdkit.Chem import Descriptors
>>> desc = dict(Descriptors.descList)
>>> len(desc)
208
>>>

The new output when using a length of 208 is the same as the original ouput. (the data.csv and pred.csv are in the repository) Test data: data.csv Orignal output: pred.csv New output: test_output.csv

I understand this may be trunctating the length to 200 instead of 208. But it really works if we compared the output. However, If you know a specific rdkit version above 2020(it's nolonger posible to install those below 2020.09.5) whose descriptors have a length of 200, that will be the best.

For now am taking 2022.09.5.

GemmaTuron commented 1 year ago

Thanks Hellen for this detailed summary. There is a major problem in doing that, because if the set of descriptors does not match the original set of descriptors used we might be getting incorrect predictions (i.e, if position 0 in the descriptor was Molecular Weight, it should still be that)

@miquelduranfrigola and @ZakiaYahya can you have a look? how did we resolve this issue previously? Do we have the set of descriptors for previous versions so that we can compare

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, I set up an environment with packages specified in the original repo. I had to downgrade conda inorder to install rdkit 2019 with 200 descriptors(messes up other conda packages). Comparing with rdkit 2022 which has 208 descriptors, the 8 which are missing in 2019 are;

Missing descriptors in rdkit_2019_03_1_descriptors:
Line 18: BCUT2D_MWHI
Line 19: BCUT2D_MWLOW
Line 20: BCUT2D_CHGHI
Line 21: BCUT2D_CHGLO
Line 22: BCUT2D_LOGPHI
Line 23: BCUT2D_LOGPLOW
Line 24: BCUT2D_MRHI
Line 25: BCUT2D_MRLOW

Regarding the model output; while the 768 embeddings returned by the model don't have feature labels, they are the same for rdkit versions with 200+ descriptors. (compared these with output from original repo)

Test file: data.csv
Original code(rdkit 2019)_200 descriptors: molbert_output.csv
Model output(rdkit 2022)_208 descriptors: test_output.csv

ZakiaYahya commented 1 year ago

Hello @GemmaTuron @HellenNamulinda In my case the original source repo used rdkit 2017 with conda-forgeand for that i needed the 208 descriptors, the 2017 version is obsolete now but can be install in python-3.6 but not on python-3.7. But i found the rdkit version 2022.3.1b1 which has the same len of descriptors plus same sequence of them as well, so i used the newer 2022 version instead of 2017 version.

GemmaTuron commented 1 year ago

Thanks @ZakiaYahya That is very helpful. @HellenNamulinda we could try that version and check again that the descriptors given are the same and the output of the original vs new model coincide? Thanks both!

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, As commented here above,
For all the versions checked sofar, only rdkit 19 has 200 descriptors. Most have 208, others 123 and 115. the latest rdkit 2023 has 209. I attached the output of the original model(rdkit 2019.03.1) and this model's output(rdkit 2022.09.5), While the output is the same, I highlighted the 8 descriptors missing in 2019

I'm not so sure how rdkit 2021.03.1 which is in the docker file worked before. It has 123 and the model only acccepts those with 200+

Also, was the pred.csv in the frameworks folder provided by the model authors? @miquelduranfrigola

GemmaTuron commented 1 year ago

Hi @HellenNamulinda

I don't know how the model can give the exact same output when passing different descriptors.. what is the position of the 8 missing descriptors in rdkit 2019? is at the end of the list or mixed in between? @miquelduranfrigola we need to be sure about these changes - potentially very dangerous

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, I set up an environment with packages specified in the original repo. I had to downgrade conda inorder to install rdkit 2019 with 200 descriptors(messes up other conda packages). Comparing with rdkit 2022 which has 208 descriptors, the 8 which are missing in 2019 are;
Missing descriptors in rdkit_2019_03_1_descriptors:
Line 18: BCUT2D_MWHI
Line 19: BCUT2D_MWLOW
Line 20: BCUT2D_CHGHI
Line 21: BCUT2D_CHGLO
Line 22: BCUT2D_LOGPHI
Line 23: BCUT2D_LOGPLOW
Line 24: BCUT2D_MRHI
Line 25: BCUT2D_MRLOW
rdkit_2022_09_5_descriptors.csv

rdkit_2019_03_1_descriptors.csv

comparison_output.txt

Regarding the model output; while the 768 embeddings returned by the model don't have feature labels, they are the same for rdkit versions with 200+ descriptors. (compared these with output from original repo)

Test file: data.csv

Original code(rdkit 2019)_200 descriptors: molbert_output.csv

Model output(rdkit 2022)_208 descriptors: test_output.csv

The missing descriptors are from 18 to 25

Missing descriptors in rdkit_2019_03_1_descriptors:
Line 18: BCUT2D_MWHI
Line 19: BCUT2D_MWLOW
Line 20: BCUT2D_CHGHI
Line 21: BCUT2D_CHGLO
Line 22: BCUT2D_LOGPHI
Line 23: BCUT2D_LOGPLOW
Line 24: BCUT2D_MRHI
Line 25: BCUT2D_MRLOW

@GemmaTuron and @miquelduranfrigola So, here is something interesting I just found out and the reason why all rdkit versions with 200+ descriptors work. The 200 descriptors in rdkit 2019 were divided into subsets that's; simple_descriptors, refractivity_descriptors and others. These can be seen in the file model/framework/molbert/utils/featurizer/molfeaturizer.py

So, It so happens that rdkit versions with 200+ have these descriptors that were explicitly specified.

miquelduranfrigola commented 1 year ago

I see, this is quite interesting, thanks @HellenNamulinda

I agree with @GemmaTuron that this is potentially very dangerous, so thanks for taking the time to look into this with so much detail, @HellenNamulinda

Since in molbert the fingerprints were explicitly specified (molfeaturizer), I see no danger in upgrading rdkit.

HellenNamulinda commented 1 year ago

Hello @GemmaTuron and @miquelduranfrigola, I understand the danger. And am glad we now why the output for the different versions is the same.

GemmaTuron commented 1 year ago

So I think we can open a PR and merge the changes? Thanks for the work Hellen!

HellenNamulinda commented 1 year ago

Hi @GemmaTuron, I have created a PR

ersilia-os / eos2thm

Clean UP & Dockerization eos2thm #6