🦠 Model Request: Prediction of Aqueous Kinetic Solubility

paulinebanye commented 1 year ago

Model Name

Aqueous Kinetic Solubility

Model Description

Prediction of Aqueous solubility is one of the most important properties in drug discovery, as it has profound impact on various drug properties, including biological activity, pharmacokinetics (PK), toxicity, and in vivo efficacy.

Slug

aqueous-kinetic-solubility

Tag

solubility, ADME

Publication

https://pubmed.ncbi.nlm.nih.gov/31176566/

Source Code

https://github.com/ncats/ncats-adme

License

None

GemmaTuron commented 1 year ago

/approve

github-actions[bot] commented 1 year ago

New Model Repository Created! 🎉

@pauline-banye ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos74bo

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

🍴 Get started by creating a fork of your new model repository - docs
👯 Clone your forked repository - docs
✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the README.md file to accurately describe your model
- Add source code for your model
- Adding documentation for your model
🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

GemmaTuron commented 1 year ago

Hi @pauline-banye can you update me? We are moving the solubility model to this new repo finally to bypass the git-lfs installs? Let me know when the old repo is ready for deleting to avoid duplications!

paulinebanye commented 1 year ago

Hi @GemmaTuron Yes, you can delete the old repo. I have all the codes and I have forked and transferred the codes to the newly cloned repo.

However, I seem to be having an issue with FPSim2 but I believe it is due to my conda environment. When I encountered this issue, I tested the model with a virtual environment on a different terminal (gitbash) and it worked flawlessly.

I am currently trying to resolve the issues with the FPSim2 dependency on my conda environment.

GemmaTuron commented 1 year ago

@pauline-banye following our discussion:

FPSim2 is only developed for PY3.8 or above, so the conda env should be 3.8
The output of the model (stable/unstable) is confusing with the solubility / insoluble expected output. Can you clarify if we are using the right model, and if so, what does stable/unstable stand for?

paulinebanye commented 1 year ago

My apologies @GemmaTuron , it is supposed to be high solubility and low solubility. I was remiss in editing that string in the repo but I have corrected it.

paulinebanye commented 1 year ago

Thank you @GemmaTuron. I deleted the previous repo and all conda environments. The new fork and environment has been upgraded to python 3.8

GemmaTuron commented 1 year ago

Perfect, thanks @pauline-banye ! Let me know how it goes with the changes!

paulinebanye commented 1 year ago

Hi @GemmaTuron I'm still getting issues with the checks but after time spent debugging, I believe it could be an issue with the python path because although the dependencies are installed, it's still returning that module not found error.

GemmaTuron commented 1 year ago

Hi @pauline-banye Let's try to debug this today, I was able to run it successfully in my system - can you get ready

List of dependencies you are installing - either manually or through .yml
List of dependencies you see in your conda env (with versions)
Python path of the conda environment

paulinebanye commented 1 year ago

Hi @pauline-banye Let's try to debug this today, I was able to run it successfully in my system - can you get ready

List of dependencies you are installing - either manually or through .yml

List of dependencies you see in your conda env (with versions)

Python path of the conda environment

Hi @GemmaTuron, Thank you so much! I would really appreciate it 🙏. I Have pushed the current dependencies to the forked repository but I need to export the dependencies from the current environment. I would send an update once I have updated the repo.

paulinebanye commented 1 year ago

Hi @GemmaTuron ;

As requested, I returned the exact values from the solubility model. These tests were carried out using two different lists of smiles.

[x] Full output without the round command eml_sol_full.csv input_sol_full.csv
[x] I repeated the model test within the Ersilia CLI using the repo_path command. eos74bo_list_run.csv eos74bo_run.csv
[x] I also compared the output recieved from NCAT with the output returned by the eos74bo solubility model rounded to two decimal places.
Output from original codes from NCAT. eml_sol_ADME_Predictions_2023-02-14-115722.csv input.sol_ADME_Predictions_2023-02-14-121509.csv

Output from eos74bo (with the results rounded to two decimal places)

output_df from input.csv

Solubility: 0.08900904655456543 seconds to predict 11 molecules
                                       smiles Predicted Class (Probability)       Prediction
0                          CC(=O)Nc1nnc(S(N)(=O)=O)s1                       0 (1.0)  high solubility
1                                         CCCOCCCCCCC                       1 (1.0)   low solubility
2                                           CCCCOCCCC                      1 (0.74)   low solubility
3                             CC(=O)N[C@@H](CS)C(=O)O                       0 (1.0)  high solubility
4                               CC(=O)Oc1ccccc1C(=O)O                      0 (0.98)  high solubility
5                                             CC(=O)O                      1 (0.79)   low solubility
6                              O=c1ncnc2[nH][nH]cc1-2                      1 (0.95)   low solubility
7                                          CCCCNCCCCC                       0 (1.0)  high solubility
8       Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1                       0 (1.0)  high solubility
9                      CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1                      1 (0.99)   low solubility
10  C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...                      0 (0.86)  high solubility
[0.0, 1.0, 0.74, 0.0, 0.020000000000000018, 0.79, 0.95, 0.0, 0.0, 0.99, 0.14]

input_sol.csv

output_df from eml.csv

Solubility: 0.5354588031768799 seconds to predict 10 molecules
                                      smiles Predicted Class (Probability)       Prediction
0      Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1                       0 (1.0)  high solubility
1                         CC(=O)Nc1nnc(S(N)(=O)=O)s1                       0 (1.0)  high solubility
2                                            CC(=O)O                      1 (0.79)   low solubility
3                            CC(=O)N[C@@H](CS)C(=O)O                       0 (1.0)  high solubility
4                              CC(=O)Oc1ccccc1C(=O)O                      0 (0.98)  high solubility
5                       Nc1nc(=O)c2ncn(COCCO)c2[nH]1                       0 (1.0)  high solubility
6  O=C(O[C@H]1C[N+]2(CCCOc3ccccc3)CCC1CC2)C(O)(c1...                      0 (0.99)  high solubility
7  CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2...                      0 (0.74)  high solubility
8                     CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1                      1 (0.99)   low solubility
9                             O=c1ncnc2[nH][nH]cc1-2                      1 (0.95)   low solubility
[0.0, 0.0, 0.79, 0.0, 0.020000000000000018, 0.0, 0.010000000000000009, 0.26, 0.99, 0.95]

eml_sol.csv

GemmaTuron commented 1 year ago

Hi @pauline-banye! Thanks for this. I am a bit confused because each file has different names, so I don't know which result corresponds to what. I want to make sure that we are always returning the probability of 1 Can you let me know what do you get when predicting the following molecules: CC(=O)Oc1ccccc1C(=O)O and CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 with the Ersilia Repo and with the Original code?

Thanks!

GemmaTuron commented 1 year ago

Looking a bit more into the results, now I understand we are giving as output the latest line I see on the code, for example: [0.0, 0.0, 0.79, 0.0, 0.020000000000000018, 0.0, 0.010000000000000009, 0.26, 0.99, 0.95] in the last one right?

so that should be fine

GemmaTuron commented 1 year ago

Two comments just to close this: Can we name the column: "proba1" instead of "value" in the Ersilia Model Hub output? And could the output have a SMILES column + Proba1 column? for easier interpretation of the results

paulinebanye commented 1 year ago

Two comments just to close this: Can we name the column: "proba1" instead of "value" in the Ersilia Model Hub output? And could the output have a SMILES column + Proba1 column? for easier interpretation of the results

Hi @GemmaTuron, I noticed that the results in the printed output differed to what was returned as the probability. I felt it was misleading so I decided to resolve this and return the exact figures in the output. To do this, I had to extract the data and these were the steps I performed:

[x] I assigned the output values to a new column in the dataframe named proba1.

[x] Extracted the proba1 column into a csv as the end result.


Solubility: 0.11705160140991211 seconds to predict 10 molecules
                                          smiles Predicted Class (Probability)       Prediction    proba1
0      Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1        0 (0.9992302207974717)  high solubility  0.000770
1                         CC(=O)Nc1nnc(S(N)(=O)=O)s1           0 (0.9997545259248)  high solubility  0.000245
2                                            CC(=O)O        1 (0.7889353036880493)   low solubility  0.788935
3                            CC(=O)N[C@@H](CS)C(=O)O        0 (0.9989036553306505)  high solubility  0.001096
4                              CC(=O)Oc1ccccc1C(=O)O        0 (0.9783817026764154)  high solubility  0.021618
5                       Nc1nc(=O)c2ncn(COCCO)c2[nH]1        0 (0.9998792609330849)  high solubility  0.000121
6  O=C(O[C@H]1C[N+]2(CCCOc3ccccc3)CCC1CC2)C(O)(c1...        0 (0.9916039435192943)  high solubility  0.008396
7  CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2...        0 (0.7405502796173096)  high solubility  0.259450
8                     CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1        1 (0.9934565424919128)   low solubility  0.993457
9                             O=c1ncnc2[nH][nH]cc1-2        1 (0.9541448950767517)   low solubility  0.954145


[eml_output.csv](https://github.com/ersilia-os/ersilia/files/10743474/eml_output.csv)

paulinebanye commented 1 year ago

Hi @pauline-banye! Thanks for this. I am a bit confused because each file has different names, so I don't know which result corresponds to what. I want to make sure that we are always returning the probability of 1 Can you let me know what do you get when predicting the following molecules: CC(=O)Oc1ccccc1C(=O)O and CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1 with the Ersilia Repo and with the Original code?

Thanks!

@GemmaTuron I'm so sorry I made it confusing, I was a bit overzealous and tried to account for all the possible test cases. Let me attempt to clarify.

I ran the tests with two different lists of smiles.
I repeated the tests with the original NCAT codes and within Ersilia.
I also ran them with the exact output values and rounded to 2 decimal places.

This is the output from Ersilia when tested with CC(=O)Oc1ccccc1C(=O)O and CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1. eos74bo_run.csv

paulinebanye commented 1 year ago

Two comments just to close this: Can we name the column: "proba1" instead of "value" in the Ersilia Model Hub output? And could the output have a SMILES column + Proba1 column? for easier interpretation of the results

@GemmaTuron Ersilia actually already accounts for the smile name in the output returned. I did try it out as you suggested but it returned errors which would require possibly editing the Ersilia codebase. Below is the error that I encountered when I added the smile name as part of the result.

File "/home/pauline/ersilia/ersilia/cli/commands/api.py", line 37, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/pauline/ersilia/ersilia/core/model.py", line 343, in api
    api_name=api_name, input=input, output=output, batch_size=batch_size
  File "/home/pauline/ersilia/ersilia/core/model.py", line 357, in api_task
    for r in result:
  File "/home/pauline/ersilia/ersilia/core/model.py", line 184, in _api_runner_iter
    for result in api.post(input=input, output=output, batch_size=batch_size):
  File "/home/pauline/ersilia/ersilia/serve/api.py", line 330, in post
    results, output, model_id=self.model_id, api_name=self.api_name
  File "/home/pauline/ersilia/ersilia/io/output.py", line 283, in adapt
    df = self._to_dataframe(result)
  File "/home/pauline/ersilia/ersilia/io/output.py", line 229, in _to_dataframe
    output_keys_expanded = self.__expand_output_keys(vals, output_keys)
  File "/home/pauline/ersilia/ersilia/io/output.py", line 197, in __expand_output_keys
    t = self._guess_pure_dtype_if_absent(v)
  File "/home/pauline/ersilia/ersilia/io/output.py", line 181, in _guess_pure_dtype_if_absent
    return dtype["type"]
TypeError: 'NoneType' object is not subscriptable

GemmaTuron commented 1 year ago

Thanks for testing! @miquelduranfrigola qhat do you think? should we just give the number as output?

GemmaTuron commented 1 year ago

This model is completed.

ersilia-os / ersilia