✍️ Contribution period: Adrian Orioki

whoisorioki commented 1 year ago

Week 1 - Get to know the community

[x] Join the communication channels
[x] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

whoisorioki commented 1 year ago

WEEK 1

I am using WSL (Ubuntu 22.04.1) and I managed to install the Ersilia Model Hub and test the eos3b5e model and it worked perfectly : )

{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "mw": 58.123999999999995
    }
}

whoisorioki commented 1 year ago

Motivation statement to work at Ersilia

My name is Adrian Orioki from Kenya. I am currently an undergraduate (3rd year) student pursuing BSc. Mathematics and Computer Science. I am privileged to be one of the Outreachy's applicant who was accepted to proceed to the contribution period. As a student, I am enthusiastic about Data Science, especially the field of AI/ML. I have been able to learn about data analysis, programming & statistics over the years partly through my bachelor's degree and partly through self-learning. I also have skills of teamwork, problem-solving and communication.

I have always loved science and my ambition has always to be able to use science to solve problems that affect humanity. I came across Ersilia while browsing through the Outreachy project list and one of the things that made me interested in it were the skill descriptions due to my interest in AI/ML. I was intrigued by the project after visiting its page. Looking through the project's descriptions, the documentation and the official website compelled me to join Ersilia as a contributor. As a Kenyan, I have succumbed to various diseases such as Malaria and witnessed a large number of people infected with diseases for which research has been limited. This is due to the limited resources for research in the country. I want to be able to uphold Ersilia initiatives/goals and become one of the scientists who are able to not only contribute but also use the ML models to solve these very named issues and other related problems.

I also want to learn new things in the field of AI/ML so as to help in my career advancement. I want to achieve all the learning goals listed in the Outreachy's website.

I believe contributing at Ersilia will advance my career in the following ways:

Establishing a platform for peer and mentor learning
Providing resources for learning AI/ML
Aid in the deployment of ML models in practice
Help me to learn new technologies and skills i.e., contributing to a FOSS community

My plan during the internship is follow the structured curriculum (way of learning and contributing) hoping to be able to do everything with help of peers and mentors where needed. I will also be following the community and contributing guidelines. After the internship I plan to continue learning and also continue to be in touch with the peers and mentors that would have been acquainted with in the internship period. Not to forget, finish my undergraduate studies : )

GemmaTuron commented 1 year ago

Hi @whoisorioki

Welcome to Ersilia and looking forward to your contributions

whoisorioki commented 1 year ago

Thank you @GemmaTuron : )

whoisorioki commented 1 year ago

WEEK 2

Here are the updates:

I chose to work with the NCATS Rat Liver Microsomal Stability. When I was going through the steps in the documentation, I encountered an error while trying to create an environment to work with the model.

At first, I tried with my WSL(Ubuntu) it gave this error which I thought was due to space:

(base) whoisorioki@Orioki:~/ncats-adme/server$ conda env create --prefix ./env -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: \ Killed

Then I went ahead and tried on my Windows OS, here is the error that I got:

(base) C:\Users\Orioki\Desktop\Ersilia\Models\ncats-adme\server>conda env create --prefix ./env -f environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
   gunicorn=20.0.4

I went ahead and looked for a solution at Stack Overflow here and here, none of them worked.
I am now trying to clone the repository branch for development. Reference from #73

whoisorioki commented 1 year ago

Update:

Managed to create the environment through the development branch clone : )

Encountered an error while running the app. Here is the error:

Traceback (most recent call last):
File "app.py", line 1, in <module>
import flask
File "C:\Users\Orioki\desktop\ersilia\models\ncats-adme\server\env\lib\site-packages\flask\__init__.py", line 14, in <module>
from jinja2 import escape
ImportError: cannot import name 'escape' from 'jinja2' (C:\Users\Orioki\desktop\ersilia\models\ncats-adme\server\env\lib\site-packages\jinja2\__init__.py)

This was the first error after running the app.py. I managed to fix it through uninstalling the flask then reinstalling it.
I did the same to the subsequent errors (according to the module causing the error).

whoisorioki commented 1 year ago

Update:

After long hours of waiting, I managed to install the model in my system (Windows 11). All the models (Rat Liver Microsomal Stability, PAMPA 7.4 Permeability, PAMPA 5.0 Permeability, Solubility, Human Liver Cytosol Stability , CYP450 isozymes - CYP2C9, CYP2D6, CYP3A4) were installed.

Finished loading CYP450 model files
* Serving Flask app 'app'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://192.168.0.103:5000
Press CTRL+C to quit

I am currently working on running the predictions, I hope I am able to complete the two remaining tasks before the third week : )

whoisorioki commented 1 year ago

Hey @GemmaTuron here are the updates:

Predictions for the EML

I ran the app.py and used the pretrained RLM graph convolutional neural network model to do predictions for the EML. It contained 442 columns of drugs with their smiles and can smiles. The model predicts their metabolic stability which can either be stable (0) or unstable (1). The in vitro half-life (t1/2) approach was the one used to determine the metabolic stability. It assesses the rate of substrate-depletion while measuring the time required for half of the substrate to deplete under controlled laboratory conditions. The stable compounds had t1/2 > 30 min while the unstable had t1/2 <= 30 min. In the EML, the number of stable compounds were 258/442 and the number of unstable compounds were the remaining 184/442. The model also shows the probability score (between 1 and 0) that represents the estimated likelihood. Here are the predictions in CSV format: ADME_Predictions_2023-03-17-204320.csv ADME_Predictions_2023-03-18-144728.csv ADME_Predictions_2023-03-18-144814.csv

Loading RLM graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading RLM model files
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 7.4 models
Loading PAMPA graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 5.0 models
Loading Solubility graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading Solubility models
Loading human liver cytosol stability random forest models
100%|█████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 38.96it/s] 
Finished loading human liver cytosol stability models
Loading CYP450 random forest models
100%|███████████████████████████████████████████████████████████████| 64/64 [00:03<00:00, 16.39it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.80it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.06it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:09<00:00,  6.86it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.71it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:10<00:00,  5.98it/s] 
100%|█████████████████████████████████████████████████████████████████| 6/6 [00:39<00:00,  6.58s/it] 
Finished loading CYP450 model files
 * Serving Flask app 'app'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://192.168.0.103:5000
Press CTRL+C to quit

Comparison with the Ersilia models

I cloned the Ersilia eos5505 model for it to be available on my machine. I located the main file in eos5505\model\framework\code and ran it. I had problems reading the input file from the terminal as an argument, so I went ahead and commented on the input_file and output_file then replaced the input file with the location to the EML file in the system and also the output file with the file that I wanted the predictions to be written to.

# input_file = sys.argv[1]
# output_file = sys.argv[2]

# read SMILES from .csv file, assuming one column with header
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\eml_canonical.csv', "r") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    smiles_list = [r[1] for r in reader]

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w") as f:
    writer = csv.writer(f)
    writer.writerow(["value"]) # header
    for o in outputs:
        writer.writerow([o])

I also had to change smiles_list = [r[0] for r in reader] to smiles_list = [r[1] for r in reader] since the smiles are in column 1 in the EML. Without that you get this error: ValueError: Please provide a list of kekule smiles

After the successful running, you get this output:

<predictors.rlm.rlm_predictor.RLMPredictior object at 0x00000248BEC56550>
100%|███████████████████████████████████████████| 442/442 [00:03<00:00, 144.91it/s]
RLM: 3.3348450660705566 seconds to predict 442 molecules
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]
                                                smiles  ... Prediction
0        Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1  ...     stable
1    C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(...  ...   unstable
2                         CC(=O)Nc1sc(nn1)[S](N)(=O)=O  ...     stable
3                                              CC(O)=O  ...     stable
4                              CC(=O)N[C@@H](CS)C(O)=O  ...     stable
..                                                 ...  ...        ...
437             CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O  ...   unstable
438                    Cc1cc(cc(C)c1CC2=NCCN2)C(C)(C)C  ...   unstable
439  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...  ...     stable
440                         [Zn++].[O-][S]([O-])(=O)=O  ...     stable
441             O.OC(Cn1ccnc1)([P](O)(O)=O)[P](O)(O)=O  ...     stable

[442 rows x 5 columns]
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]

Compared to head (5) and the tail (5) of the RLM graph convolutional neural network model by NCATS, the predictions are exactly the same. The problem encountered was that the file that the prediction was written to is not exactly readable as a csv file. Here is the file: prediction.csv.

Later on, I transformed the output to a pandas data frame and modified the code to write the predictions to a csv file as required:

# transform output to pandas dataframe
df = pd.DataFrame(output_df)
print(df.columns)

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['smiles', 'Predicted Class (Probability)', 'Prediction']) # columns
    for row in df.itertuples(index=False):
        writer.writerow(row)

Here is the csv file: prediction.csv

It has the same predictions as the predictions of RLM graph convolutional neural network model by NCATS with 258 stable compounds and 184 unstable compounds.

Reason for choosing NCATS Rat Liver Microsomal Stability

Initially, I wasn't sure why I chose to work with NCATS Rat Liver Microsomal Stability, but now since I have worked with it, I have learned so many things including how GCNN model compared to RNN, DNN and RF models is more convenient when training data, testing data and validating prediction results of compounds with molecular structures and chemical properties : )

GemmaTuron commented 1 year ago

Hi @whoisorioki

Good work thanks for this! @pauline-banye I think Zakia also reported issues when trying to run .csv files in the models, can you have a look at the above and let me know your thoughts?

@whoisorioki let's start tackling week 3 tasks meanwhile :)

whoisorioki commented 1 year ago

Thank you @GemmaTuron. Week 3 here we go!

whoisorioki commented 1 year ago

WEEK 3

MODEL ONE:

Model Name Relational Deep Learning for Drug Pair Scoring

Model Description Drug pair scoring is a machine learning task that involves a set of drugs and the task of predicting the behavior of drug pairs. It is used for predicting the effectiveness and safety of drug combinations. The drug pairs can be between two entities that may share similar functions or metabolic pathways. Various combinations exist, such as protein-protein, compound-protein, miRNA-mRNA, and chemical-chemical interactions (CCI). ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect, and synergy prediction. It uses traditional SMILES as input and predicts the output as a probability score. It uses PyKEEN which is a Python package designed to train and evaluate knowledge graph embedding models. Some of the model architectures used in ChemicalX include DeepSynergy, DeepDDI, CASTER, MatchMaker, GCN-BMP among others.

Relevance to Ersilia ChemicalX can be used in Ersilia to monitor chemical interaction between drug molecules, therefore enabling one to predict potential side effects as well as identify any unintended chemical reactions that might take place. Drug Pair Scoring is used for applications such as Antibiotic Evolutionary Pressure, Reducing Toxicity, Encoding Molecular Geometry among others which are essential in Drug Discovery.

Implementation Code for ChemicalX is readily made for implementation. It has implementation examples of the various model architectures. It also has underlying data for training and testing the models.

Slug ChemicalX

Tag chemistry

GitHub Repository chemicalx

Dependencies PyTorch 1.10.0

Publication A Unified View of Relational Deep Learning for Drug Pair Scoring

License Apache License 2.0

GemmaTuron commented 1 year ago

Hi @whoisorioki ,

We are looking forward to your contributions for week 3 ;)

whoisorioki commented 1 year ago

Hi @whoisorioki ,

We are looking forward to your contributions for week 3 ;)

Hey @GemmaTuron, I am currently researching and reviewing some of the models I picked up in my literature search. Hope to drop them soon!

paulinebanye commented 1 year ago

Hey @GemmaTuron here are the updates:

Predictions for the EML

I ran the app.py and used the pretrained RLM graph convolutional neural network model to do predictions for the EML. It contained 442 columns of drugs with their smiles and can smiles. The model predicts their metabolic stability which can either be stable (0) or unstable (1). The in vitro half-life (t1/2) approach was the one used to determine the metabolic stability. It assesses the rate of substrate-depletion while measuring the time required for half of the substrate to deplete under controlled laboratory conditions. The stable compounds had t1/2 > 30 min while the unstable had t1/2 <= 30 min. In the EML, the number of stable compounds were 258/442 and the number of unstable compounds were the remaining 184/442. The model also shows the probability score (between 1 and 0) that represents the estimated likelihood. Here are the predictions in CSV format: ADME_Predictions_2023-03-17-204320.csv ADME_Predictions_2023-03-18-144728.csv ADME_Predictions_2023-03-18-144814.csv

Loading RLM graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading RLM model files
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 7.4 models
Loading PAMPA graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 5.0 models
Loading Solubility graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading Solubility models
Loading human liver cytosol stability random forest models
100%|█████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 38.96it/s] 
Finished loading human liver cytosol stability models
Loading CYP450 random forest models
100%|███████████████████████████████████████████████████████████████| 64/64 [00:03<00:00, 16.39it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.80it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.06it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:09<00:00,  6.86it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.71it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:10<00:00,  5.98it/s] 
100%|█████████████████████████████████████████████████████████████████| 6/6 [00:39<00:00,  6.58s/it] 
Finished loading CYP450 model files
 * Serving Flask app 'app'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://192.168.0.103:5000
Press CTRL+C to quit

Comparison with the Ersilia models

I cloned the Ersilia eos5505 model for it to be available on my machine. I located the main file in eos5505\model\framework\code and ran it. I had problems reading the input file from the terminal as an argument, so I went ahead and commented on the input_file and output_file then replaced the input file with the location to the EML file in the system and also the output file with the file that I wanted the predictions to be written to.

# input_file = sys.argv[1]
# output_file = sys.argv[2]

# read SMILES from .csv file, assuming one column with header
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\eml_canonical.csv', "r") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    smiles_list = [r[1] for r in reader]

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w") as f:
    writer = csv.writer(f)
    writer.writerow(["value"]) # header
    for o in outputs:
        writer.writerow([o])

I also had to change smiles_list = [r[0] for r in reader] to smiles_list = [r[1] for r in reader] since the smiles are in column 1 in the EML. Without that you get this error: ValueError: Please provide a list of kekule smiles

After the successful running, you get this output:

<predictors.rlm.rlm_predictor.RLMPredictior object at 0x00000248BEC56550>
100%|███████████████████████████████████████████| 442/442 [00:03<00:00, 144.91it/s]
RLM: 3.3348450660705566 seconds to predict 442 molecules
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]
                                                smiles  ... Prediction
0        Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1  ...     stable
1    C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(...  ...   unstable
2                         CC(=O)Nc1sc(nn1)[S](N)(=O)=O  ...     stable
3                                              CC(O)=O  ...     stable
4                              CC(=O)N[C@@H](CS)C(O)=O  ...     stable
..                                                 ...  ...        ...
437             CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O  ...   unstable
438                    Cc1cc(cc(C)c1CC2=NCCN2)C(C)(C)C  ...   unstable
439  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...  ...     stable
440                         [Zn++].[O-][S]([O-])(=O)=O  ...     stable
441             O.OC(Cn1ccnc1)([P](O)(O)=O)[P](O)(O)=O  ...     stable

[442 rows x 5 columns]
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]

Compared to head (5) and the tail (5) of the RLM graph convolutional neural network model by NCATS, the predictions are exactly the same. The problem encountered was that the file that the prediction was written to is not exactly readable as a csv file. Here is the file: prediction.csv.

Later on, I transformed the output to a pandas data frame and modified the code to write the predictions to a csv file as required:

# transform output to pandas dataframe
df = pd.DataFrame(output_df)
print(df.columns)

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['smiles', 'Predicted Class (Probability)', 'Prediction']) # columns
    for row in df.itertuples(index=False):
        writer.writerow(row)

Here is the csv file: prediction.csv

It has the same predictions as the predictions of RLM graph convolutional neural network model by NCATS with 258 stable compounds and 184 unstable compounds.

Reason for choosing NCATS Rat Liver Microsomal Stability

Initially, I wasn't sure why I chose to work with NCATS Rat Liver Microsomal Stability, but now since I have worked with it, I have learned so many things including how GCNN model compared to RNN, DNN and RF models is more convenient when training data, testing data and validating prediction results of compounds with molecular structures and chemical properties : )

Great job debugging @whoisorioki, I am not quite sure why there are issues with csv file but I would take a closer look at this model codebase.

whoisorioki commented 1 year ago

Hey @GemmaTuron here are the updates:

Predictions for the EML

I ran the app.py and used the pretrained RLM graph convolutional neural network model to do predictions for the EML. It contained 442 columns of drugs with their smiles and can smiles. The model predicts their metabolic stability which can either be stable (0) or unstable (1). The in vitro half-life (t1/2) approach was the one used to determine the metabolic stability. It assesses the rate of substrate-depletion while measuring the time required for half of the substrate to deplete under controlled laboratory conditions. The stable compounds had t1/2 > 30 min while the unstable had t1/2 <= 30 min. In the EML, the number of stable compounds were 258/442 and the number of unstable compounds were the remaining 184/442. The model also shows the probability score (between 1 and 0) that represents the estimated likelihood. Here are the predictions in CSV format: ADME_Predictions_2023-03-17-204320.csv ADME_Predictions_2023-03-18-144728.csv ADME_Predictions_2023-03-18-144814.csv

Loading RLM graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading RLM model files
Loading PAMPA graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 7.4 models
Loading PAMPA graph convolutional neural network model
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading PAMPA 5.0 models
Loading Solubility graph convolutional neural network model
Model File Exists Locally
Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
Finished loading Solubility models
Loading human liver cytosol stability random forest models
100%|█████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 38.96it/s] 
Finished loading human liver cytosol stability models
Loading CYP450 random forest models
100%|███████████████████████████████████████████████████████████████| 64/64 [00:03<00:00, 16.39it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:05<00:00, 10.80it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.06it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:09<00:00,  6.86it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:04<00:00, 13.71it/s] 
100%|███████████████████████████████████████████████████████████████| 64/64 [00:10<00:00,  5.98it/s] 
100%|█████████████████████████████████████████████████████████████████| 6/6 [00:39<00:00,  6.58s/it] 
Finished loading CYP450 model files
 * Serving Flask app 'app'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://192.168.0.103:5000
Press CTRL+C to quit

Comparison with the Ersilia models

I cloned the Ersilia eos5505 model for it to be available on my machine. I located the main file in eos5505\model\framework\code and ran it. I had problems reading the input file from the terminal as an argument, so I went ahead and commented on the input_file and output_file then replaced the input file with the location to the EML file in the system and also the output file with the file that I wanted the predictions to be written to.

# input_file = sys.argv[1]
# output_file = sys.argv[2]

# read SMILES from .csv file, assuming one column with header
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\eml_canonical.csv', "r") as f:
    reader = csv.reader(f)
    next(reader) # skip header
    smiles_list = [r[1] for r in reader]

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w") as f:
    writer = csv.writer(f)
    writer.writerow(["value"]) # header
    for o in outputs:
        writer.writerow([o])

I also had to change smiles_list = [r[0] for r in reader] to smiles_list = [r[1] for r in reader] since the smiles are in column 1 in the EML. Without that you get this error: ValueError: Please provide a list of kekule smiles After the successful running, you get this output:

<predictors.rlm.rlm_predictor.RLMPredictior object at 0x00000248BEC56550>
100%|███████████████████████████████████████████| 442/442 [00:03<00:00, 144.91it/s]
RLM: 3.3348450660705566 seconds to predict 442 molecules
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]
                                                smiles  ... Prediction
0        Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1  ...     stable
1    C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(...  ...   unstable
2                         CC(=O)Nc1sc(nn1)[S](N)(=O)=O  ...     stable
3                                              CC(O)=O  ...     stable
4                              CC(=O)N[C@@H](CS)C(O)=O  ...     stable
..                                                 ...  ...        ...
437             CC(=O)CC(c1ccccc1)C2=C(O)Oc3ccccc3C2=O  ...   unstable
438                    Cc1cc(cc(C)c1CC2=NCCN2)C(C)(C)C  ...   unstable
439  CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)C...  ...     stable
440                         [Zn++].[O-][S]([O-])(=O)=O  ...     stable
441             O.OC(Cn1ccnc1)([P](O)(O)=O)[P](O)(O)=O  ...     stable

[442 rows x 5 columns]
    Predicted Class (Probability) Prediction
0                        0 (0.95)     stable
1                        1 (0.71)   unstable
2                         0 (1.0)     stable
3                         0 (1.0)     stable
4                         0 (1.0)     stable
..                            ...        ...
437                      1 (0.53)   unstable
438                       1 (1.0)   unstable
439                       0 (1.0)     stable
440                       0 (1.0)     stable
441                       0 (1.0)     stable

[442 rows x 2 columns]

Compared to head (5) and the tail (5) of the RLM graph convolutional neural network model by NCATS, the predictions are exactly the same. The problem encountered was that the file that the prediction was written to is not exactly readable as a csv file. Here is the file: prediction.csv. Later on, I transformed the output to a pandas data frame and modified the code to write the predictions to a csv file as required:

# transform output to pandas dataframe
df = pd.DataFrame(output_df)
print(df.columns)

# write output in a .csv file
with open('C:\\Users\\Orioki\\Desktop\\Ersilia\\Models\\eos5505\\model\\framework\\code\\prediction.csv', "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['smiles', 'Predicted Class (Probability)', 'Prediction']) # columns
    for row in df.itertuples(index=False):
        writer.writerow(row)

Here is the csv file: prediction.csv It has the same predictions as the predictions of RLM graph convolutional neural network model by NCATS with 258 stable compounds and 184 unstable compounds.

Reason for choosing NCATS Rat Liver Microsomal Stability

Initially, I wasn't sure why I chose to work with NCATS Rat Liver Microsomal Stability, but now since I have worked with it, I have learned so many things including how GCNN model compared to RNN, DNN and RF models is more convenient when training data, testing data and validating prediction results of compounds with molecular structures and chemical properties : )

Great job debugging @whoisorioki, I am not quite sure why there are issues with csv file but I would take a closer look at this model codebase.

Thank you @pauline-banye. While I was debugging, I really didn't understand this part of the code:

outputs = []
for x in list(output_df[OUTPUT_COLUMN_NAME]):
    c = int(x.split(" ")[0])
    p = float(x.split("(")[1].split(")")[0])
    if c == 1:
        outputs += [p]
    else:
        outputs += [1-p]

It's its output that is not quite understood in the csv file. When you comment that part, the model runs with no errors, and you get the correct output. You will just have to change the code so that output_df is outputted instead of outputs.

whoisorioki commented 1 year ago

MODEL TWO:

Model Name PaccMann: Anticancer Compound Sensitivity Prediction via Multimodal Attention-Based Convolutional Encoders

Model Description PaccMann is a is a package for drug sensitivity prediction that predicts the IC50 sensitivity value of a drug component. The model architecture used take traditional SMILES as input which was the main comparison in the research with a baseline model that takes molecular descriptors (fingerprints) i.e., Morgan fingerprints as input which shown to be a highly informative representation for many chemical prediction tasks in the recent years. Some of the architectures used were Stacked Convolutional Encoder (SCNN), Multiscale Convolutional Attention (MCA), Contextual-Attention (CA).

Relevance to Ersilia

Implementation

Slug PaccMann

Tag drug sensitivity anticancer

GitHub Repository paccmann

Packages

numpy>=1.14.3
tensor2tensor>=1.10.0
tensorflow>=1.15.2,<2.0.0
plac>=0.9.6
setuptools>=41.0.0
pandas>=0.24.1
scikit-learn==0.22.1

Publications

License MIT License

GemmaTuron commented 1 year ago

Hi @whoisorioki

The piece of code you are highlighting converts the output of the model, which, if you recall, is the probability of 0 or the probability of 1, to ALWAYS the probability of 1 (they are complementary, hence

    if c == 1:
        outputs += [p]
    else:
        outputs += [1-p]

We do this systematically so that model outputs are consistent from one model to another, but it shouldn't be crashing. @pauline-banye when you have time can you please check why this is not working?

GemmaTuron commented 1 year ago

For the PaccMann model: It is an interesting model but out of scope for us at this current moment, since it is focused on cancer - as you can see in the description they combine chemistry properties with genomic data on cancer cell lines - this kind of studies are broadly available for cancer and other non communicable diseases but unfortunately not so much data is yet available for infectious diseases

whoisorioki commented 1 year ago

For the PaccMann model: It is an interesting model but out of scope for us at this current moment, since it is focused on cancer - as you can see in the description they combine chemistry properties with genomic data on cancer cell lines - this kind of studies are broadly available for cancer and other non communicable diseases but unfortunately not so much data is yet available for infectious diseases

Hi @GemmaTuron , caught me while I was editing! Thank you for the information. So does that mean availability of data is a key area to look at when searching for models?

GemmaTuron commented 1 year ago

Hi @whoisorioki !

The models should be either pretrained (ideal) or with the data available to train them. In this case, the model is pretrained which is great, I was just pointing out that these kind of studies are not yet widely available for communicable diseases - hopefully now that -omics technologies are cheaper we see more of these studies. I'd suggest to focus on communicable diseases for the rest of the models, or ADMET properties ;)

whoisorioki commented 1 year ago

Hi @whoisorioki !

The models should be either pretrained (ideal) or with the data available to train them. In this case, the model is pretrained which is great, I was just pointing out that these kind of studies are not yet widely available for communicable diseases - hopefully now that -omics technologies are cheaper we see more of these studies. I'd suggest to focus on communicable diseases for the rest of the models, or ADMET properties ;)

Okay thank you @GemmaTuron. I will look into that : )

whoisorioki commented 1 year ago

WEEK 3

MODEL ONE:

Model Name Relational Deep Learning for Drug Pair Scoring

Model Description Drug pair scoring is a machine learning task that involves a set of drugs and the task of predicting the behavior of drug pairs. It is used for predicting the effectiveness and safety of drug combinations. The drug pairs can be between two entities that may share similar functions or metabolic pathways. Various combinations exist, such as protein-protein, compound-protein, miRNA-mRNA, and chemical-chemical interactions (CCI). ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect, and synergy prediction. It uses traditional SMILES as input and predicts the output as a probability score. It uses PyKEEN which is a Python package designed to train and evaluate knowledge graph embedding models. Some of the model architectures used in ChemicalX include DeepSynergy, DeepDDI, CASTER, MatchMaker, GCN-BMP among others.

Relevance to Ersilia ChemicalX can be used in Ersilia to monitor chemical interaction between drug molecules, therefore enabling one to predict potential side effects as well as identify any unintended chemical reactions that might take place. Drug Pair Scoring is used for applications such as Antibiotic Evolutionary Pressure, Reducing Toxicity, Encoding Molecular Geometry among others which are essential in Drug Discovery.

Implementation Code for ChemicalX is readily made for implementation. It has implementation examples of the various model architectures. It also has underlying data for training and testing the models.

Slug ChemicalX

Tag chemistry

GitHub Repository chemicalx

Dependencies PyTorch 1.10.0

Publication A Unified View of Relational Deep Learning for Drug Pair Scoring

License Apache License 2.0

Here is my first model @GemmaTuron incase you did not see it!

GemmaTuron commented 1 year ago

Hi @whoisorioki !

I missed that one, great find! Please add it in our model list! And while you look for the third one, can you also try to implement chemicalX see if it is easy and ready?

whoisorioki commented 1 year ago

MODEL THREE:

Model Name Accurate ADMET Prediction with XGBoost

Model Description Extreme gradient boosting (XGBoost) is a is a powerful machine learning model and has been shown to be effective in regression and classification tasks in biology and chemistry. It predicts accurate ADMET properties through machine learning of molecule features ranging from fingerprints to descriptors. XGBoost does prediction for 22 tasks in ADMET properties. All these tasks ranked first in 11 tasks with all tasks ranked in top 5 in the TDC (Therapeutics Data Commons) benchmark. This shows XGBoost to be effective and certainly accurate in ADME properties. XGBoost boosts model performance through ensemble that includes decision tree models trained in sequence.

Relevance to Ersilia XGBoost is relevant to Ersilia since it does prediction on properties such as absorption, distribution, metabolism, excretion, and toxicity are important in small molecule drug discovery and therapeutics. It has been trained a number of and has shown to be effective. It has a web server here which takes SMILES as input and outputs the predictions in various units depending on the ADMET property.

Implementation XGBoost repository contains code that is ready for use and also a README file that has indicated on how it can be implemented.

Slug ADMET_XGBoost

Tag ADMET Properties

GitHub Repository ADMET_XGBoost

Publication Accurate ADMET Prediction with XGBoost

License GNU General Public License v3.0

GemmaTuron commented 1 year ago

Hi @whoisorioki !

Good suggestion, someone else already added this one to the list though! Can you let me know when you test the ChemicalX? thanks

whoisorioki commented 1 year ago

Hi @whoisorioki !

Good suggestion, someone else already added this one to the list though! Can you let me know when you test the ChemicalX? thanks

Hey @GemmaTuron, thanks. I have had issues setting its environment, but I am working my way through it!

whoisorioki commented 1 year ago

Hey @GemmaTuron, I did the tests on ChemicalX. Here are the findings:

I tested all the source code examples provided at the repository. Majority of the examples test different models using dataset from DrugComDB which is a comprehensive database dedicated to integrating drug combinations from various data sources.
The task related to the dataset is to predict the synergistic nature of drug pair combinations which loaded by the DrugComDB DatasetLoader.
ChemicalX uses a pipeline that summarize the results in a roc_auc score which ranges from 0 to 1, with a score of 0.5 indicating a model with a combination of drugs that has a lower synergistic effect, and a score of 1 indicating a perfect model with a combination of drugs that has a greater synergistic effect.

Here are the results of the examples that tested using the dataset:


(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python deepdrug_example.py
22:42:27   No cuda devices were available. CPU will be used.
100%|██████████████████████████████████████████████████████████████████████████| 20/20 [54:32<00:00, 163.64s/it]
Metric       Value
--------  --------
roc_auc   0.741245

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python deepsynergy_example.py 22:45:13 No cuda devices were available. CPU will be used. 100%	█████████████████████████████████████████████████████████████████████████	100/100 [06:05<00:00, 3.66s/it] Metric Value

roc_auc 0.840477

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python deepddi_example.py 22:42:19 No cuda devices were available. CPU will be used. 100%	██████████████████████████████████████████████████████████████████████	100/100 [5:24:41<00:00, 194.82s/it] Metric Value

roc_auc 0.918528

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python epgcnds_example.py 04:16:35 No cuda devices were available. CPU will be used. 100%	███████████████████████████████████████████████████████████████████████████	20/20 [08:58<00:00, 26.92s/it] Metric Value

roc_auc 0.714368

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python matchmaker_example.py 05:00:18 No cuda devices were available. CPU will be used. 100%	█████████████████████████████████████████████████████████████████████████	100/100 [03:44<00:00, 2.25s/it] Metric Value

roc_auc 0.795124

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python gcnbmp_example.py 04:28:05 No cuda devices were available. CPU will be used. 100%	███████████████████████████████████████████████████████████████████████	100/100 [1:37:52<00:00, 58.72s/it] Metric Value

roc_auc 0.724965

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python mrgnn_example.py 08:42:40 No cuda devices were available. CPU will be used. 100%	████████████████████████████████████████████████████████████████████████████	1/1 [06:15<00:00, 375.07s/it] Metric Value

roc_auc 0.660053

(chemicalX) whoisorioki@whoisorioki:~/Desktop/Ersilia/Models/chemicalx/examples$ python ssiddi_example.py 08:45:15 No cuda devices were available. CPU will be used. 100%	██████████████████████████████████████████████████████████████████████████	20/20 [45:42<00:00, 137.12s/it] Metric Value

roc_auc 0.774424

whoisorioki commented 1 year ago

Hey @GemmaTuron, I went ahead and searched for another model : )

Model Name Predicting Antimicrobial Activity of Conjugated Oligoelectrolyte Molecules via Machine Learning

Model Description This model presents a framework for establishing antibiotic property predictions. It consists of four components: (1) molecular representation, (2) feature down-selection, (3) ML algorithm selection, and (4) molecular descriptor importance analysis. This framework is applied to to a set of 136 conjugated oligoelectrolyte molecules (COEs), using an automated molecular descriptor down selection process that is agnostic to the molecule domain. Findings from the research done is that the resulting fingerprint consisted of 21 molecular descriptors, over 40% of which are related to the three-dimensional shape of the molecules.

Relevance to Ersilia It is relevant since it aids in development of antibiotics even in novel domains, namely with families of understudied candidates, where sparse information exists regarding the underlying mechanism and there is a limited availability of experimental data. COEs have properties such as high solubility, tunable electronic properties, and ability to interact with biological systems which makes them promising in Drug discovery.

Implementation MLforCOE is pre-trained, it requires testing from the GitHub source code which seems complete.

Slug Name MLforCOE

Tag antimicrobial activity

GitHub repository MLforCOE

Publication Predicting antimicrobial activity of conjugated oligoelectrolyte molecules via machine learning

License BSD 2-Clause "Simplified" License

GemmaTuron commented 1 year ago

Hi @whoisorioki !

Very good job:

For ChemicalX, can you share one of the inputs as example? I think this model will be fairly easy to incorporate
For MLforCOE: excellent finding, would you add it to the model list and try to run it as well?

Meanwhile, also start working on your final application! Thanks

whoisorioki commented 1 year ago

Hi @whoisorioki !

Very good job:

For ChemicalX, can you share one of the inputs as example? I think this model will be fairly easy to incorporate

For MLforCOE: excellent finding, would you add it to the model list and try to run it as well?

Meanwhile, also start working on your final application! Thanks

Okay thank you!!

whoisorioki commented 1 year ago

Hey @GemmaTuron, ChemicalX in the example takes 3 inputs; two json files and one csv file. The two json files contain the drug and context properties/features. The csv file contains two drugs, the context and a label which is either 1(positive association) or 0 (negative assosiation). The two drugs are represented by their DrugCombDB IDs. labeled_triples.csv Here is the example:

drug_1,drug_2,context,label
3385,11960529,A2058,1
3385,24856436,A2058,1
3385,11977753,A2058,1
3385,387447,A2058,0
3385,3062316,A2058,1
3385,46926350,A2058,1
3385,176870,A2058,1
3385,5288382,A2058,1
3385,216453,A2058,1
3385,208908,A2058,1
3385,24964624,A2058,1
3385,24958200,A2058,1
3385,24748204,A2058,1
3385,11520894,A2058,1
3385,46239015,A2058,1
3385,9826528,A2058,1
3385,216239,A2058,1
3385,5329102,A2058,1
3385,5394,A2058,1
3385,5311,A2058,1

I am currently installing the MLforCOE, when its done, I will inform you!

GemmaTuron commented 1 year ago

Thanks @whoisorioki , indeed the input is quite complex for ChemicalX so we will need some work to be able to incorporate it in the Hub, but it is very interesting so we'll work on it soon

whoisorioki commented 1 year ago

Okay thank you @GemmaTuron!

whoisorioki commented 1 year ago

Hey @GemmaTuron, I managed to implement the MLforCOE in my local machine.

I ran the Main_downselection.py which generates data files in fingeprints, SMILES, SELFIES that will be used in different training models.
I then ran the Main_training_models.py which trains diferent models (RF, XGB, GP) comparing with different inputs (fingerprints, SMILES, SELFIES).
Performance is evaluated by R-squared and RMSE metrics.
The RMSE and R-squared values are reported for two different datasets which had been split in the first process (Main_downselection.py).
Here are the examples of the outputs;
```
XGB and Smiles:
```

R2 and RMSE for dataset 0 : [ 0.3929155 -0.03083802 0.32813973 0.28257756 0.20804465 0.2286269 0.54612148 0.07170333 0.40727222 0.30483451 0.46117561 0.57184541 0.2104038 0.3499984 0.48319175 -0.42951909 0.41850305 0.69192428 0.23175764 0.38234122] [1.71411613 2.14535095 1.90032865 1.91676487 1.80400708 1.8623012 1.522838 1.98261348 1.49088201 1.7439464 1.50700386 1.36802918 1.77631653 1.84110206 1.21146798 2.36187952 1.65922411 1.10138465 1.98349215 1.82206366] Mean: 0.3055509963049215 1.7357556232899953 Std: 0.23575292267469147 0.2965946784122653 Min: -0.42951909064381155 1.1013846497008384 Max: 0.6919242802294252 2.361879524632779 Test set RMSE= 1.8599130352670108 and R2= -0.014450586145497102 Exp. validation set RMSE= 1.9894047332969895 and R2= -0.4691577912905971

XGB and Init:

R2 and RMSE for dataset 0 : [ 0.27183543 0.58154211 0.07897737 0.24880151 0.28826405 -0.01808532 0.37391035 0.27603054 0.00395453 0.40428951 0.29778918 0.56942526 0.10850721 0.13897994 0.38536966 0.18317177 0.3038131 0.31763374 0.18802074 0.39681417] [1.58210799 1.33786705 1.97483042 1.6748901 1.71344997 2.31191434 1.82844767 1.75087355 1.99506669 1.56683718 1.6987194 1.37189015 1.79671338 1.85491694 1.35789949 1.78536957 1.76921496 1.644833 1.79426054 1.59815988] Mean: 0.26995224270610024 1.720413114489863 Std: 0.15775187516926095 0.22403041977204155 Min: -0.018085320298312002 1.3378670478307844 Max: 0.5815421106448464 2.311914343475151 Test set RMSE= 1.4323199414610988 and R2= 0.39837524495392207 Exp. validation set RMSE= 1.2254522706563338 and R2= 0.44253840821832224

XGB and Var:

R2 and RMSE for dataset 0 : [ 0.24134226 0.46044944 0.13411055 0.18066168 0.17860309 0.08738952 0.4792623 0.11754138 -0.33349481 -0.12996165 0.13513386 0.41157575 0.22200392 0.2772839 -0.58866206 0.15696596 0.16325922 0.41574489 0.13988815 0.21637358] [1.61489504 1.51915872 1.91481086 1.74920454 1.8407233 2.18888214 1.6675305 1.93304442 2.30841293 2.15793371 1.88522129 1.60376356 1.6784507 1.69942381 2.18311388 1.81378296 1.93960434 1.52199875 1.84667515 1.82158396] Mean: 0.14827354553883662 1.8444107282013746 Std: 0.24931998620075652 0.22149875838384755 Min: -0.5886620640051075 1.5191587192447882 Max: 0.4792622966143745 2.308412933931831 Test set RMSE= 1.5838679589159732 and R2= 0.2643291169264953 Exp. validation set RMSE= 1.7965911413108724 and R2= -0.19817611153635184

XGB and Cor:

R2 and RMSE for dataset 0 : [ 0.57676706 0.6284889 -0.00596234 0.10623145 0.31853559 0.13817725 0.57894451 0.25870081 0.19840754 -0.36013642 0.53343738 0.4872574 0.63366298 0.34805562 -0.07098882 0.13343198 0.01253893 0.28866441 0.35576831 0.50598183] [1.20617711 1.26058744 2.06388512 1.82692828 1.67661584 2.12710354 1.49945674 1.77170504 1.78975938 2.36754085 1.38465907 1.49707914 1.15175548 1.61407276 1.79247495 1.83892531 2.10706373 1.67938515 1.59821087 1.44632576] Mean: 0.2832982180199428 1.6849855788224466 Std: 0.26068269986967973 0.3146594690344365 Min: -0.3601364200688273 1.1517554846225941 Max: 0.6336629792149882 2.367540854957527 Test set RMSE= 1.6250941260423688 and R2= 0.22553345498609667 Exp. validation set RMSE= 1.7545364219705062 and R2= -0.14273867231085036

XGB and Opt:

R2 and RMSE for dataset 0 : [0.12703836 0.55984146 0.39755716 0.18474977 0.23237657 0.51053359 0.51355391 0.28713182 0.08026015 0.04418864 0.49405038 0.50133697 0.11646214 0.5038602 0.1869661 0.20284183 0.08452607 0.73392218 0.52041627 0.52363049] [1.73228336 1.37211858 1.59717529 1.74483526 1.77945122 1.60302796 1.61169045 1.73739783 1.91712458 1.98468844 1.44192118 1.47638168 1.78867924 1.4080558 1.56176237 1.76374182 2.0288067 1.02711098 1.37893965 1.42025601] Mean: 0.3402622035482893 1.6187724199418656 Std: 0.2006452960751286 0.2363667539273782 Min: 0.04418864002325196 1.0271109838946801 Max: 0.7339221821195103 2.028806702610671 Test set RMSE= 1.4732337090502798 and R2= 0.36351391158884316 Exp. validation set RMSE= 2.067194572202576 and R2= -0.586298307332165 Running GP takes longer because kernel optimization is integrated into the implementation.

GP and Morgan:

R2 and RMSE for dataset 0 : [0.32209632 0.0176418 0.21413084 0.04668786 0.51170018 0.42808418 0.55467937 0.5134696 0.43813664 0.21306176 0.57609016 0.64887179 0.46201346 0.54384108 0.4825351 0.32102978 0.49232703 0.56096549 0.10340975 0.35716274] [1.81133869 2.09429603 2.05524859 2.20952502 1.41654759 1.60355454 1.50841311 1.43532302 1.45154662 1.85549314 1.33668011 1.23887623 1.46623485 1.54233509 1.21223738 1.62775157 1.55032714 1.31479975 2.14278387 1.85883036] Mean: 0.390396746422285 1.6366071351840692 Std: 0.18042038474241184 0.30124428475146625 Min: 0.017641800321648304 1.2122373835464895 Max: 0.6488717948466534 2.209525023506047 Test set RMSE= 2.522576868338572 and R2= -0.86609796383485 Exp. validation set RMSE= 1.4847802072296323 and R2= 0.18163605359649515

GP and Morgan count:

R2 and RMSE for dataset 0 : [0.38024481 0.1216862 0.23860005 0.05526003 0.56388488 0.4254514 0.5057151 0.51617942 0.49451133 0.34054608 0.49688584 0.65232684 0.39812371 0.34192441 0.4805188 0.43528715 0.35871118 0.61843278 0.13712545 0.35654135] [1.73191175 1.98028632 2.02299895 2.19956859 1.33871602 1.60724122 1.58917806 1.43132029 1.37680155 1.69856142 1.45621 1.23276599 1.55085592 1.85250138 1.21459682 1.48448768 1.74244132 1.22573281 2.10210882 1.85972854] Mean: 0.39589784011734025 1.6349006718491677 Std: 0.15678395893398336 0.29120579403559715 Min: 0.055260027926978816 1.2145968234280393 Max: 0.652326836746002 2.1995685864183283 Test set RMSE= 2.230301690037068 and R2= -0.4587230582352493 Exp. validation set RMSE= 1.6012197902688854 and R2= 0.048247454388727795

GP and Rdkit:

R2 and RMSE for dataset 0 : [ 0.05805827 0.14696882 0.06460371 0.06836723 0.29625744 0.01228506 0.19149164 0.16396924 0.25903959 0.01694618 0.27208536 0.11938081 0.23204548 0.06366363 0.05855426 0.34014512 0.00762227 0.06107829 -0.00116787 0.21722707] [2.13514656 1.95157653 2.2422656 2.18425703 1.7005711 2.10733456 2.03247937 1.88150678 1.66691377 2.0738516 1.75158399 1.96195541 1.75180429 2.20971669 1.63510208 1.6046746 2.16755353 1.92275953 2.26430443 2.05119587] Mean: 0.13243108021985747 1.9648276659708976 Std: 0.10445258244332867 0.21046094104497956 Min: -0.0011678716433520808 1.6046745976109709 Max: 0.34014512068089053 2.2643044252426776 Test set RMSE= 1.8459383595403198 and R2= 0.0007365316063323579 Exp. validation set RMSE= 1.6495592095329805 and R2= -0.010085103803018969

GP and Selfies:

R2 and RMSE for dataset 0 : [-0.09483574 -0.00462827 -0.02177605 -0.10347526 -0.03709864 -0.04815327 -0.04393798 -0.09124932 -0.02694282 -0.02623985 -0.00213878 -0.02463137 -0.00815833 -0.01305073 -0.02083333 -0.00506757 -0.02394691 -0.02332189 -0.04895208 -0.03897601] [1.93997367 2.07295452 2.08004398 2.02997301 2.06834037 2.34580576 2.36103065 2.14959505 2.02577387 2.05650933 2.02932601 2.11630805 1.91066372 2.01202527 1.75 1.98043408 2.14563774 2.01427689 2.03934576 2.09747802] Mean: -0.03537071030747089 2.061274787114383 Std: 0.029053369560597755 0.13035156304800072 Min: -0.10347526265463536 1.7500000000000007 Max: -0.0021387832699619747 2.361030646800647 Test set RMSE= 1.8466185312615795 and R2= 3.892441924335799e-13 Exp. validation set RMSE= 1.6488095805389835 and R2= -0.009167262204392568

GP and Smiles:

R2 and RMSE for dataset 0 : [-1.80613663e-01 -1.14110822e-01 2.00905441e-02 1.17746298e-01 -1.74485545e-03 -5.76237517e-02 -2.38570370e-02 -1.99400402e-02 -1.56087409e-02 -1.90623604e-01 -8.67955935e-06 -1.52925756e-02 -1.67276560e-02 5.75788533e-03 -3.06080401e-02 -2.07971671e-01 -2.57467033e-02 -1.33789208e-03 -2.00663401e-02 -4.48903187e-03] [2.39039446 2.23032077 2.29499746 2.12558319 2.02892712 2.1806365 2.28719688 2.07817434 1.95154625 2.28231862 2.05301737 2.10664165 2.01567334 2.27701938 1.7107793 2.17115654 2.20369452 1.98564044 2.2855755 2.32360249] Mean: -0.03913881881712518 2.1491448061594793 Std: 0.07614727173779065 0.15804418245632998 Min: -0.20797167133390704 1.710779296511066 Max: 0.11774629848780349 2.39039445679728 Test set RMSE= 1.846634552800752 and R2= -1.7352374672263693e-05 Exp. validation set RMSE= 1.6495599966617294 and R2= -0.010086067778375618

GP and Init:

R2 and RMSE for dataset 0 : [0.14925405 0.66245184 0.3375792 0.15912747 0.54367339 0.14117016 0.59310866 0.05717029 0.55028602 0.52129818 0.57624708 0.53702942 0.26552682 0.43235519 0.31586686 0.38421405 0.39747306 0.74599692 0.47868053 0.3920857 ] [1.71009918 1.20158625 1.67479493 1.77204214 1.37198662 2.12340685 1.47402041 1.99807271 1.34055929 1.40455512 1.31960587 1.42256382 1.63082686 1.50610819 1.43261978 1.55016401 1.64590936 1.00353502 1.4376892 1.60441176] Mean: 0.4120297440205344 1.5312278687154781 Std: 0.18341271145908594 0.2501853672245012 Min: 0.05717028590443829 1.0035350175835136 Max: 0.7459969181594248 2.1234068483667383 Test set RMSE= 1.4269997431041446 and R2= 0.4028362853902363 Exp. validation set RMSE= 1.5888215436505226 and R2= 0.06292923499368619



- I am yet to run the other files, but an overview is the model is easy to implement but it requires a lot of time to implement and understand it.

GemmaTuron commented 1 year ago

Thanks @whoisorioki !

The steps you have taken are to reproduce the model training, right? Which is great to know we can re run their code. I see they state in the README: Alternatively, the fully trained model is available by request from the authors (not included into this repository due to its large size). I will contact the authors to ask for the model checkpoints, so we don't have to retrain it.

Thanks for all the work! Now focus the last couple of days in preparing a good final application :)

whoisorioki commented 1 year ago

Oh Okay, I get it! Thank you @GemmaTuron, let me focus on the final application :)

whoisorioki commented 1 year ago

Hey @GemmaTuron, I have submitted my final application:) Thank you for being helpful. I enjoyed working at Ersilia. I was asking if it is okay for you to assign me a task that I can do while waiting on 4th May. Thank you!

GemmaTuron commented 1 year ago

Hi @whoisorioki !

thanks for your work and your interest, we prefer to close the contribution period and continue the work with the selected applicants on May.

whoisorioki commented 1 year ago

Okay thank you @GemmaTuron.

ersilia-os / ersilia