joiboi08 commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

joiboi08 commented 1 year ago

Hello everyone. Excited to start contributing. I am now starting with installation of the Ersilia Model Hub! I will also work on a motivation letter alongside it and update both of their progress shortly. I am working on a Windows machine and following the instructions mentioned here!

joiboi08 commented 1 year ago

WEEK 1 - Updates

Task 1 - I joined Esilia's Slack channel from their Outreachy landing page and was welcomed by a community of warm peers and team leads! It was reassuring and exciting to be part of something like this.

Task 2 - Opened this issue with success! :)

Task 3 - Since I am using a Windows platform, I installed WSL and Ubuntu terminal environment as mentioned here. Faced a small issue in my Ubuntu terminal not recognising WSL so I had to manually enable it from Windows Features. It worked fine afterwards. Continued through the mentioned steps.

I installed all prerequisites -

gcc compiler
installed and updated Conda to latest release 23.9.0
I already had Python version 3.10.12
installed and activated Github CLI
installed and activated Git LFS
Many people faced the issue mentioned here hence I have avoided installing Isaura as a solution as advised here.
I am also using Python version greater than 3.7 so if it works I will update it here as per instructed on Slack by @DhanshreeA
installed Docker

After the prerequisites, I installed the Ersilia tool! Here I was faced with an issue because even after I was done with my installation and I had activated the conda environment - I could not run ersilia --help or ersilia --catalog I determined this to be because my WSL version was not updated so I updated it to 2 and made sure Ubuntu was using WSL2. I also configured Docker Desktop to use WSL2 after my update. This fixed my problem and I was able to finally install Ersilia!!

The installation guide as well as my peers like @leilayesufu who shared a detailed documentation of their journey were of immense help wherever I got stuck and I am grateful to them.

Finally, onto testing!

joiboi08 commented 1 year ago

Task 3 cont.

First few testing steps were calling a catalog function and running a simple model. Alas, I am facing issues in calling ersilia catalog where it mentions a Errno:101 Network Unreachable error. I have attached a log file below. myfile.log ersilia --help works fine. So I moved on to ersilia -v fetch eos3b5e which is also not working with a different error (log file below) fetchLog.log

joiboi08 commented 1 year ago

I have found a solution to this issue. Certain service providers in India block githubraw and that was causing the Errno101. People may try switching to a different service provider or use a VPN but a more feasible solution is changing your DNS to - Cloudfare DNS 1.1.1.1 1.0.0.1 for ipv4 and 2606:4700:4700::1111 2606:4700:4700::1001 for ipv6.

Changing to Google DNS is also working.

This has worked well for me and I have successfully tested and used a simple model deployment of Ersilia, getting the desired result that was specified in the instructions. As was instructed on Slack, I have used Python version 3.10.12 and did not install isaura (thanks to @carcablop) , hence I did not face this issue.

This was a great learning experience for me. I faced errors where I thought everything would go smoothly whereas places where I expected troubles were free flowing. A big bag of my gratitude to the supportive peers that took the time to upload their experiences comprehensively and the community leaders for taking their time to go through everyone's issues to give prompt solutions here and on Slack.

I'm excited for what is to come next! I will add shortly my motivations for applying to Outreachy and what I aim to achieve.

I will now mark task 3 as completed!

joiboi08 commented 1 year ago

Motivation Statement - Task 4

My name is Joyesh Banerjee. I am an engineering graduate from India and did it in electronics and communication engineering. I started out with C and Java but shifted to Python after a while because I was garnering interest in data analysis and later data science in general. I did a few college projects in which we trained a prediction model based on open source datasets and I found the work lively and rewarding. I come from a lower middle class family and have always lived by the skin of our teeth but all thanks to God, we have come far. Every parent works through their bones to give their children a better platform to grab opportunities than they had and I believe this is such a platform that they envisioned for us.

I came across Outreachy as a recommendation from a friend who applied previously and I was warmed by how much soul they had as an organization. They were incredibly inclusive and gave me a chance to really tell my story in my application which I appreciated, so did many of my peers I am sure of it. So I was very happy I was given a chance as a contributor here, because when I was going through the projects - I felt a similar blow of warmth from Ersilia. This was corroborated by the cordial and supportive team I found myself in when I joined their channels. They were prompt and provided quick resolutions for issues. The best part was - even when many of us were facing issues in some initial tasks, in the time the solution had not been found the team was interactive and didn't keep us in the dark. I already have some experience with training models so I firmly believe my time in the internship will be the perfect incubator for my skills to grow rapidly and showcase themselves in the best of ways - helping people in need. And for that I am primed and ready to learn and apply myself to the fullest. After the internship period has ended, I intend to keep applying my competence here by working with the community to plan meaningful contributions. I also want to challenge myself by learning new tech stacks like cloud (AWS) to make deployment of these models more efficient for users and further supplement Ersilia's growth. I want to eventually learn new skills to find a way to enhance current processes.

I have also seen my fair share of medical issues and what hurts most is medical incompetence. I lost my grandmother to undiscovered side effects during her treatment which blindsided the family and the doctors on her case. When I went through Ersilia's objectives and intentions with this technology, I couldn't help but think what could have happened if someone had thought of this a decade earlier.

As the saying goes - "The best time to plant a tree was 10 years ago. The second best time is now."

This community is working on a novel goal that will help countless people and I wish to become part of that effort to my utmost. It will swell my heart with joy if I can help further this idea to reality - so that a decade from now maybe one less child will wish someone had thought of something today.

joiboi08 commented 1 year ago

Submit your first contribution to Outreachy - Task 5

After providing a detailed motivational letter, I have submitted a contribution report through Outreachy and linked this issue as instructed! The contribution has been recorded successfully.

This marks the end of all Week - 1 tasks and is hence brought to closure!

DhanshreeA commented 1 year ago

Hello @joiboi08 thank you for the detailed updates. If you'd like, you can get started with the tasks from week 2 now. :)

joiboi08 commented 1 year ago

Hello @joiboi08 thank you for the detailed updates. If you'd like, you can get started with the tasks from week 2 now. :)

Thank you! I'll update my progress in week 2 shortly.

joiboi08 commented 1 year ago

Week - 2

After completing week - 1 tasks, attending a wonderful and informative session and some personal time - I have begun work on week 2!

Tasks of this week pit us the closest to real-internship work so far. Understood summary of tasks -

install one of the suggested models and run them locally, reference the author's repo and source code as needed. Mention why this mode was chosen.
once installed and successfully tested, use the Essential Medicines List CSV as in input for the chosen model. Use third party implementation of the model to extract output.
compare this output to the Ersilia Model Hub implementation of the same model. Note any differences or points of interest. Since I have no experience with Docker, I will test myself and use a Docker container to implement the Ersilia Model.

This is my interpretation of the Week - 2 tasks. If I misinterpreted something I am happy to be corrected.

Task 1 - Choose a model, explain why and run it locally

For this task, I was divided between PPBopt model and the STOUT model. My understanding of the former is that it is a prediction engine that predicts how well a compound binds with blood for transportation to target sites, finding important use cases in optimizing drug development costs and time. This model was interesting and is also the closest to my experience as I have worked on prediction models before.

However, I wanted to try something different and the STOUT model peaked my interest. My understanding of it is converts a compound's SMILE name (which is essentially the ASCII-symbol representation of its structure so it is machine readable) to its IUPAC defined name and vice-versa. This model was also much better documented and I felt I saw a more concrete roadmap here.

Hence, I opted to work with the STOUT model.

joiboi08 commented 1 year ago

Task 2 - Install and run the STOUT model locally

To start with, I am following these steps for the STOUT model installation.

First, I open my local Linux environment that I setup in the Week - 1 tasks. I run some -- version checks to ensure I am using appropriate versions - Python 3.11.4 Conda 23.9.0 and WSL 2

All good! Onto the installation!

I first create a conda environment to activate the model in : conda create --name STOUT
Activating STOUT conda activate STOUT
Installing dependencies conda install -c decimer stout-pypi

Two tries were made to fetch repo data but failed. Log file for the failure - fail.log It could not find the pystow package. I tried doing pip install pystow but it did not work - it gave the same error again.

So I tried using alternate methods mentioned in the repo - pip install STOUT-pypi It successfully installed all required packages! (worth ~624MB!)

Since the model was now installed, I was ready to test!

joiboi08 commented 1 year ago

While testing, I am running into some issues.

I needed a dataset for testing and I found a demo dataset in the STOUT repo and tried to use it. However, it keeps giving me a ImportError - log attached below. logs.log

EDIT - This is a post-Task 2-completion edit. Since I got the model working and the test file ran fine, I wanted to test it with a more comprehensive test file like the demo file I mentioned here. However, when I ran it - just like with VsCode, the Ubuntu terminal also went into a no-return suspend state and my CPU usage hit a constant 100% again. I'll try and wait an hour like this and update progress here.

joiboi08 commented 1 year ago

I was primarily running this on Ubuntu but I have since switched to VSCode (and its CLI) for better maneuverability. I got the same ImportError referenced above. I found a solution by changing the relative import in the stout.py file to an absolute import :

from .repack import helper to from repack import helper

This allowed me to move forward. After I ran the code again, it finally downloaded the model and gave me success message (in the VSCode CLI) that model was loaded. But when it compiling the code, it tagged the IUPAC_names_test.txt file with a FileNotFoundError So I switched the working directory to cd STOUT which solved it.

However, after this I ran the code hoping no issues should persist now. But it is in a suspended state and has not given an output in the CLI. I checked Task Manager for any clues and it showed a constant 100% CPU usage the whole time I was viewing Task Manager.

I am going to try again.

After much trying - Success!!

I did a fresh re-try. Remade and overwrote the previous conda env.

Activated conda STOUT env and tried to install using conda install -c decimer stout-pypi but it kept giving missing package errors so I skipped that used pip install STOUT-pypi

All dependencies successfully installed! In my previous attempts, my inexperience was failing me as I kept trying to launch my test file in the conda env from my working directory outside the conda env! But small mistakes like this is also why it is such a big learning for me.
I moved to the STOUT conda env at miniconda3\envs\STOUT

I kept using ls to check where I was after my initial mistakes
We see Requirement already satisfied a lot because I kept retrying and rerunning code so many packages were already done
In the Conda Env, I made a test file from this code present in the STOUT repo and named it test.py

ran the test.py file using python3 test.py
it gave me JVM DLL Not Found error : OSError: [Errno 0] JVM DLL not found: Define/path/or/set/JAVA_HOME/variable/properly
installed openjdk version 11 using sudo apt install default-jre
set the JAVA_PATH using export JAVA_HOME = /usr/bin as this is where the prev step has put the root

Tested successfully!!

This period is shaping up to be a concrete learning experience for me. It gets a little mental sometimes but it is always rewarding when I manage to pull through!

Task 2 - Complete.

joiboi08 commented 1 year ago

Task 3 - Running the model with EML as the input dataset

Now that we have setup our model and tested it once, we use it in a pseudo-real time scenario.

I create a file eml_result.py in VSCode.
First, we go through the EML dataset provided here to determine what kind of conversion is being made.
I can see that three categories are given
- drugs or common name
- smiles
- can_smiles
The dataset itself was very large and it would have taken my machine a large amount of time to process the entire set. So I decided I would sample 20 data points as my input.
For that I needed the csv python module.
Made my necessary imports:

import csv
from STOUT import translate_forward

I did not have experience with this module so I had to refer to the csv documentation
I wanted to only translate with the canonical forms given in this EML.
So first, I wanted to convert the CSV EML file to a list for easy working. I did that using this code :


# intention is to convert the EML csv into a list version of EML

with open("eml_canonical.csv", newline='') as eml_csv :

    reader = csv.reader(eml_csv)   # returns each row of EML as a list

    eml_list = list(reader)  # list of each eml row as a list

Now I had eml_list which was a list of each row of the EML csv file.
Now we extract the first 20 to-be-translated canonical forms from the source EML list into another list called can_smiles_list :

can_smiles_list = []  # empty list that will hold canonical smiles  

for name in eml_list[1:21] : # includes first 20 SMILE rows excluding the header

    can_smiles_list.append(name[2])   # we have a list of canonical smiles to be translated

Now we create an empty list iupac_ that contains the translated names of all 20 canonical smiles:

iupac_ = [] # empty list that will hold translated iupac names

for name in can_smiles_list : 

    result = translate_forward(name)

    iupac_.append(result)

🔴 This where I am facing an issue. I am running this in VSCode and it does NOT recognise from STOUT as a module. I have made sure my working directory is in the conda environment and run my code from there. But it is not recognising it. Any help is appreciated!

The problem is resolved! 🟢 🟢

After a discussion with my peer @PromiseFru, I was able to conclude that the problem was I had to separately activate the STOUT library in the VSC CLI again. Since my working dir was in the conda env made during the installation of the model, I thought this wouldn't be an issue. To fix it I can do 2 things -

Install conda packages again through the VSC CLI and then run the file
Run it directly from Ubuntu as it was already set-up and conda activated

I chose to do the latter as it saved time but I will install conda on VSC for future work.

I added a for loop to print the translated list in a readable manner

for i in iupac_ :
    print(i)

I ran python3 eml_result.py
I got the expected results as shown here. I will also attach a log file after the image.

I added some code to remove the print( ) statement and instead make a separate .csv file for the translated IUPAC names as predicted_iupac.csv :

# writes a list of translated iupac names to the file 'predicted_iupac.csv'
with open("predicted_iupac.csv", "w") as trans_iupac :
    writer = csv.writer(trans_iupac)
    for i in iupac_ :
        writer.writerow(i)

Working data and Result data CSV files :

Full code for your perusal

 # Since the EML file has canonical SMILE names 
# we import only translate_forward to translate from SMILES to IUPAC 
import csv
from STOUT import translate_forward  

#! CONVERTING EML CSV TO EML LIST OF LISTS
# intention is to convert the EML csv into a list version of EML
with open("eml_canonical.csv", newline='') as eml_csv :
    reader = csv.reader(eml_csv)   # returns each row of EML as a list
    eml_list = list(reader)  # list of each eml row as a list 

#! EXTRACTING to-be-translated CANONICAL FORMS FROM SOURCE EML LIST
can_smiles_list = []  # empty list that will hold canonical smiles  
for name in eml_list[1:21] : # includes first 20 SMILE rows excluding the header
    can_smiles_list.append(name[2])   # we have a list of canonical smiles to be translated

iupac_ = [] # empty list that will hold translated iupac names
for name in can_smiles_list : 
    result = translate_forward(name)
    iupac_.append(result)

# writes a list of translated iupac names to the file 'predicted_iupac.csv'
with open("predicted_iupac.csv", "w") as trans_iupac :
    writer = csv.writer(trans_iupac)
    for i in iupac_ :
        writer.writerow(i)

Task 3 Completed!

joiboi08 commented 1 year ago

Task 4 - Docker Deployment of the Ersilia Hub implementaion of the STOUT Model

I want to try completing this using docker!
First I will try and run the model locally using the instructions provided here.
Changed my working dir to miniconda3/envs/ersilia
Activated Ersilia conda env ------ this step is easy to miss! conda activate ersilia
checked if Ersilia was working by running ersilia --help and ersilia catalog ----- both ran well 💯
Now I was ready to fetch the model. The SMILES to IUPAC model is under name eos4se9 with slug smiles2iupac ersilia fetch eos4se9
Initially, it did not run. So I tried docker which return as notFound command. So I launched DockerDesktop and
- retried docker ---- success!
- retried ersilia fetch eos4se9 ---- success!
Served the model ersilia serve eos4se9

Now, the model is ready to use!

I will feed the model a test dataset (.csv) of two SMILES : task3.csv
The way to run the model as mentioned here is to :

ersilia api run -i <<input_file.csv>> -o <<desired_output_file_name.csv>>

Used this with above .csv file

ersilia -v api run -i task3.csv -o result3.csv

This is currently giving me the TypeError: object of type 'NoneType' has no len() that was faced in the initial Week - 1 tasks. The best working solutions were to
- uninstall isaura, but I never installed isaura
- make sure python >= 3.7, my current version is 3.10.12 in the conda env
- reinstall the env and try again, I did this to no avail.
- the illegal solution was changing the code in read_input_columns which I have not tried

Please advise @DhanshreeA @carcablop @HellenNamulinda

errorLog.log

leilayesufu commented 1 year ago

Have you tried giving it a single input as opposed to the entire EML file to test it?

joiboi08 commented 1 year ago

Hi @leilayesufu I haven't processed the entire file yet. I was feeding it a modified dataset task3.csv of 2 inputs as a test before giving it the entire EML set.

leilayesufu commented 1 year ago

Okay, try testing it with a single input directly as though. ersilia -v api run -i "Nc1nc(NC2CC2)c2ncn([C@H]3C=CC@@HC3)c2n1" not through the file

joiboi08 commented 1 year ago

I've run into a worse problem. I am unable to fetch or serve models. I keep getting the connection reset by peer error without fail. I have reinstalled the environment multiple times without this changing. ConnectionResetLog.log

leilayesufu commented 1 year ago

I'm going to try to do it, and i'll get back to you

joiboi08 commented 1 year ago

Thank you so much. I look forward to hearing your experience. I am using Ubuntu 22.04

leilayesufu commented 1 year ago

HI, so i fetched the model and served it as seen

then i ran ersilia card eos4se9, the output showed as seen here

"Code": "$ ersilia serve smiles2iupac\n$ ersilia api -i 'CCCOCCC'\n$ ersilia close",

So to run predictions, i just did ersilia api -i "CCCC" and i got the output below here

This was just a simple texting although @Promisefru ran it with some inputs from the EML file and it gave him a null output

joiboi08 commented 1 year ago

Hi @leilayesufu Thank you for trying this out for yourself. I reinstalled my environment and tried your steps to-the-letter, but I kept getting one of three errors when I tried to use ersilia api -i "CCCC

Errno104 Connection reset by peer
wood.log
Errno111 Max retries/Connection refused this happened only once

leilayesufu commented 1 year ago

HI, I'm thinking it could be your network then

joiboi08 commented 1 year ago

Hi @leilayesufu I have a strong connection and I have also made sure I am not running into this error again as I can view githubraw files. I previously was able to fetch and run models but I have only recently been unable to do so.

leilayesufu commented 1 year ago

Hi, i would suggest removing the entire environment and starting afresh, or you could wait for a mentors opinion. @DhanshreeA

DhanshreeA commented 1 year ago

Hi @joiboi08 as discussed over Slack, let me look into this more. I will get back to you by tomorrow.

joiboi08 commented 1 year ago

Thank you @DhanshreeA and @leilayesufu. I am looking forward to the updates. Meanwhile is it ok if I move on to Week - 3 tasks for now?

leilayesufu commented 1 year ago

Hi, since the problem is a geographical one. I'll suggest using a vpn and changing your location to complete your week 2 tasks. Ofcourse, you'll need the go ahead from @DhanshreeA

joiboi08 commented 1 year ago

Hi @leilayesufu, thank you for your suggestion. I ran a VPN and did a fresh install of ersilia and the conda environment as well as the git packages. It is now successfully able to fetch and serve models so I am a little relieved. I believe my peer @Ajoke23 also mentioned this on Slack, thank you as well. @DhanshreeA VPN is working as an interim solution for the regional service outages.

Currently, I am again facing this problem - TypeError: object of type 'NoneType' has no len() I am trying some solutions and will update here.

🟢 🟢 SOLUTION - On the advice of my peer @AlphonseBrandon I added headers to my input file and it worked. Thank you so much.

Input file eml.csv This file has the first 20 rows of canonical SMILES names (excluding the header row) to match the 20 rows of input used in 3rd party STOUT model implementation.

Now, I feed the this file into my fetched model eos4se9 using the command $ ersilia -v api run -i 'eml.csv' -o 'result.csv'

Now, I get the result output file result.csv

BUT

The first 9 rows do not have a translation. I ran it again and this time I did NOT have any rows translated. During both the processes, two things happened consistently -

Batch prediction failed and it swtiched to individual prediction
I got a 504 error from every single row that failed to translate

I always get these DEBUG logs :

11:24:14 | DEBUG    | Starting Docker Daemon service
11:24:14 | DEBUG    | Creating temporary folder /tmp/ersilia-nb24uaof and mounting as volume in container
11:24:14 | DEBUG    | Image ersiliaos/eos4se9:latest is available locally
11:24:14 | DEBUG    | Using port 46089

I didn't really have experience with Docker, but from what I could tell, the inputs were given to a remote docker container running the eos4se9 model which in turn returned predictions. And googling the error 504 informed me it is a timeout error. So basically, I was not getting translated outputs because the container kept getting timed out.
I searched around and found two solutions -
1. Instead of communicating with a remote container, I run the predictions directly from within the container.
2. Increase the nginx request timeout.
I chose option 1 as it will allow me to gain more experience with Docker implementation.
So - as I fetch and serve the models on Ubuntu, I see a corresponding container being created on Docker Desktop. From here, I can get my container ID to call it in Ubuntu. From Docker Desktop, my current running container has id - eos4se9_7a24
Since processing the entire EML dataset will take an impractical amount of time (mainly due to hardware limitations), I have taken the first 20 rows of the dataset as input for both the STOUT 3rd party model and the Ersilia Hub Model.
I already have a modified dataset er_task3.csv so the step to take here is to copy this file into the working dir of my container.
I was able to complete this step using the docker cp command -

$ docker cp er_task3.csv eos4se9_7a24:/root

Thank you @leilayesufu @PromiseFru for helping me figure out the container dir!

Now the modified dataset er_task3.csv is copied into the working dir of the container.

To access this container through Ubuntu, I use the command -

$ docker exec -it eos4se9_7a24 sh

Check to see if the dataset is present using # ls

It is! Great!

We input the dataset here and run the model and output the result into a file

# ersilia -v api run -i er_task3.csv -o er_result.csv

Now, the generated result is still in the container and to access it, we need to copy it to our local system

$ docker cp eos4se9_7a24:/root \\wsl.localhost\Ubuntu\home\joyesh\miniconda3\envs\ersilia

Here, the container files are copied over to the mentioned destination and we can easily find our result file here.

er_result.csv

Successfully predicted all SMILES names to IUPAC!

Comparison between STOUT implementation and Ersilia implementation

After getting both results, I wanted to combine both result files into a single csv or excel file. For that, I wrote some python code to :

turn both .csv files into respective lists

I used the csv module again

import csv 

with open("er_result.csv", newline='') as ers :
    ers_base_list = list(csv.reader(ers))  # list of list of each ersilia translation row

ers_list = []
for name in ers_base_list[1:] :
    ers_list.append(name[2])  # extracting only rows under iupacs_names column

with open("111predicted_iupac111.csv", newline='') as stout : 
    stout_base_list = list(csv.reader(stout)) # list of list of stout translation rows

stout_list = []
for name in stout_base_list : # no headers in this file
    stout_list.append(name[0]) # list of stout translations

combine those lists into one list of format : [<STOUT iupac name>, <Ersilia iupac name>]

result_list = []
for i in range(0,20) :  # because 20 SMILES names were translated
    result_list.append(stout_list[i]) # adding STOUT IUPAC
    result_list.append(ers_list[i])   # adding Ersilia IUPAC

turn that list into a single .csv file

with open("comparison_result.csv", "w") as comp :
    writer1 = csv.writer(comp)
    for i in result_list :
        writer1.writerow([i])

The resultant comparison file comparison_result.csv

My Interpretations

The STOUT third party model was for me more modular in terms of determining an output format. The output file generated as seen here does NOT have excess columns, only the translated IUPAC names. This made it easier to work with as it required less cleaning/prepping.
The Ersilia Model however gives a verbose multi-column list without an opportunity to change that output since the output file is generated directly by the model. Whereas in the STOUT model, we imported the STOUT module and used the translate forward function in the code we write from scratch.
On the flipside, this makes the Ersilia Model implementation more time efficient as no python is needed to generate an output file. This has greater weight as deployment time and efficiency is higher here.
As for the result contents itself, there is mostly no difference save for a few situations. The models perform similarly/identically for smaller, less complex inputs like CCCC or CC(=O)O
But minor differences start to show for larger, more complex inputs as seen here :

(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol

(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol

This week was the greatest challenge yet as I made myself familiar with new technology and got stuck A LOT!! However, it was joyous to see myself progress. Excited for the next tasks!

Marking Week - 2 complete!

joiboi08 commented 1 year ago

Week - 3

Marks the start of some real field work!

First Model Proposition

PIGNet2 - A Versatile Deep Learning-based Protein-Ligand Interaction Prediction Model for Binding Affinity Scoring and Virtual Screening

Publication - Papers With Code
Source Code - Github
Authors : Seokhyun Moon, Sang-Yeon Hwang, Jaechang Lim, Woo Youn Kim
Date Published - July 3, 2023

My interpretation of the model :

Their objective is to predict PLI (Protein-Ligand Interaction) in the form of screening (identifying compounds that possibly have binding affinity OR do not have it) and scoring (predicting the binding affinity of the protein-ligand complex in a way that is comparable to experimental values) as well as improving the binding process.
Why? It is not the first model to try and predict PLI. However, the important differentiator is that this model is achieving high accuracy results in two different tasks simultaneously using the same dataset. Most other models are trained in a task-specific way, i.e they perform well in one task but cannot do well in a different task. This is due to the lack of experimental structure-affinity data that limits the generalization ability of existing models. This makes PigNet2 a novel ML model that is dexterous and gives high accuracy results for different tasks in a relatively efficient manner vis a viz having separate models for each of those tasks which still provide lower accuracy.
Their solution to create a generalized model despite the lack of available structure-affinity experimental data was use an inductive bias and augment existing data to create similar/near-native structures that were energetically and geometrically similar to the crystal structures. The model was then trained to predict the binding affinity of these structures to be the same as the experimental value. This made PigNet2 show significantly enhanced scoring and screening performance.

Relevance to Ersilia

The objective of Ersilia is to bring pre-trained, ready-to-use, resource efficient and most importantly, highly accurate machine learning models that are already trained in a specific scenario to scientists and researchers that cannot afford to expend time and resources to create such models from scratch.
This model is a good fit because it uses a novel solution to overcome the problem of building generalized models i.e models trained utilizing a single dataset able to perform more than one task with reliable accuracy and efficiency. This means separate models are not needed for screening compounds for binding with a specific protein and scoring/ranking the binding ability of the compounds that do.
All this can be done by a single model, meaning this has immense potential for use in drug discovery by allowing researchers and pharmas to efficiently scan the numerous compounds in the chemical space for binding candidates for a particular protein.

DhanshreeA commented 1 year ago

Hi @leilayesufu, thank you for your suggestion. I ran a VPN and did a fresh install of ersilia and the conda environment as well as the git packages. It is now successfully able to fetch and serve models so I am a little relieved. I believe my peer @Ajoke23 also mentioned this on Slack, thank you as well. @DhanshreeA VPN is working as an interim solution for the regional service outages.

Currently, I am again facing this problem - TypeError: object of type 'NoneType' has no len() I am trying some solutions and will update here.

🟢 🟢 SOLUTION - On the advice of my peer @AlphonseBrandon I added headers to my input file and it worked. Thank you so much.

* Input file
  [eml.csv](https://github.com/ersilia-os/ersilia/files/12895576/eml.csv)
  This file has the first 20 rows of canonical SMILES names (excluding the header row) to match the 20 rows of input used in 3rd party STOUT model implementation.

Now, I feed the this file into my fetched model eos4se9 using the command $ ersilia -v api run -i 'eml.csv' -o 'result.csv'

* Now, I get the result output file
  [result.csv](https://github.com/ersilia-os/ersilia/files/12895771/result.csv)

BUT

The first 9 rows do not have a translation. I ran it again and this time I did NOT have any rows translated. During both the processes, two things happened consistently -

1. Batch prediction failed and it swtiched to individual prediction

2. I got a 504 error from every single row that failed to translate

* I always get these DEBUG logs :

11:24:14 | DEBUG    | Starting Docker Daemon service
11:24:14 | DEBUG    | Creating temporary folder /tmp/ersilia-nb24uaof and mounting as volume in container
11:24:14 | DEBUG    | Image ersiliaos/eos4se9:latest is available locally
11:24:14 | DEBUG    | Using port 46089

* I didn't really have experience with Docker, but from what I could tell, the inputs were given to a remote **docker container** running the `eos4se9 model` which in turn returned predictions. And googling the `error 504` informed me it is a timeout error. So basically, I was not getting translated outputs because the container kept getting timed out.

* I searched around and found two solutions -
  1. Instead of communicating with a remote container, I run the predictions directly from within the container.
  2. Increase the nginx request timeout.

* I chose **option 1** as it will allow me to gain more experience with Docker implementation.

* So - as I fetch and serve the models on Ubuntu, I see a corresponding container being created on Docker Desktop. From here, I can get my `container ID` to call it in Ubuntu.
  From Docker Desktop, my current running container has id - `eos4se9_7a24`

* Since processing the entire EML dataset will take an impractical amount of time (mainly due to hardware limitations), I have taken the first 20 rows of the dataset as input for both the STOUT 3rd party model and the Ersilia Hub Model.

* I already have a modified dataset [er_task3.csv](https://github.com/ersilia-os/ersilia/files/12907831/er_task3.csv) so the step to take here is to copy this file into the working dir of my container.

* I was able to complete this step using the docker `cp` command -

$ docker cp er_task3.csv eos4se9_7a24:/root

Thank you @leilayesufu @PromiseFru for helping me figure out the container dir!

* Now the modified dataset er_task3.csv is copied into the working dir of the container.

To access this container through Ubuntu, I use the command -

$ docker exec -it eos4se9_7a24 sh

* Check to see if the dataset is present using
  `# ls`
  ![image](https://user-images.githubusercontent.com/94055810/275242982-3aad8167-456f-4d01-b69e-a434d08b3e22.png)

It is! Great!

* We input the dataset here and run the model and output the result into a file

# ersilia -v api run -i er_task3.csv -o er_result.csv

* Now, the generated result is still in the container and to access it, we need to copy it to our local system

$ docker cp eos4se9_7a24:/root \\wsl.localhost\Ubuntu\home\joyesh\miniconda3\envs\ersilia

Here, the container files are copied over to the mentioned destination and we can easily find our result file here.

er_result.csv

Successfully predicted all SMILES names to IUPAC!

Comparison between STOUT implementation and Ersilia implementation

After getting both results, I wanted to combine both result files into a single csv or excel file. For that, I wrote some python code to :

* turn both .csv files into respective lists

I used the csv module again

import csv 

with open("er_result.csv", newline='') as ers :
    ers_base_list = list(csv.reader(ers))  # list of list of each ersilia translation row

ers_list = []
for name in ers_base_list[1:] :
    ers_list.append(name[2])  # extracting only rows under iupacs_names column

with open("111predicted_iupac111.csv", newline='') as stout : 
    stout_base_list = list(csv.reader(stout)) # list of list of stout translation rows

stout_list = []
for name in stout_base_list : # no headers in this file
    stout_list.append(name[0]) # list of stout translations

* combine those lists into **one** list of format :  `[<STOUT iupac name>, <Ersilia iupac name>]`

result_list = []
for i in range(0,20) :  # because 20 SMILES names were translated
    result_list.append(stout_list[i]) # adding STOUT IUPAC
    result_list.append(ers_list[i])   # adding Ersilia IUPAC

* turn that list into a single .csv file

with open("comparison_result.csv", "w") as comp :
    writer1 = csv.writer(comp)
    for i in result_list :
        writer1.writerow([i])

* The resultant comparison file
  [comparison_result.csv](https://github.com/ersilia-os/ersilia/files/12907649/comparison_result.csv)

My Interpretations

* The STOUT third party model was for me more modular in terms of determining an output format. The output file generated [as seen here](https://github.com/ersilia-os/ersilia/files/12896825/111predicted_iupac111.csv) does NOT have excess columns, only the translated IUPAC names. This made it easier to work with as it required less cleaning/prepping.

* The Ersilia Model however gives a verbose multi-column list without an opportunity to change that output since the output file is generated directly by the model. Whereas in the STOUT model, we imported the STOUT module and used the `translate forward` function in the code we write from scratch.

* On the flipside, this makes the Ersilia Model implementation more time efficient as no `python` is needed to generate an output file. This has greater weight as deployment time and efficiency is higher here.

* As for the result contents itself, there is mostly no difference save for a few situations. The models perform similarly/identically for smaller, less complex inputs like `CCCC` or `CC(=O)O`

* But minor differences start to show for larger, more complex inputs as seen here :

(3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol

(1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol

This week was the greatest challenge yet as I made myself familiar with new technology and got stuck A LOT!! However, it was joyous to see myself progress. Excited for the next tasks!

Marking Week - 2 complete!

Hi @joiboi08 many congratulations on making it this far. Good job on learning more about working with docker, thank you @leilayesufu and @PromiseFru for all the help here.

As for the network connection we faced earlier, the issue seemed to have resolved on its own after a couple of days, and I can work with Ersilia normally again without VPN. (As guessed, it was probably a geographical outage)

joiboi08 commented 1 year ago

I'm having fun learning new things! @DhanshreeA And thank you for the update! It is a relief that it is not something permanent ☺️

joiboi08 commented 1 year ago

Running The Model

To run this model, I followed the instructions mentioned in their repository.

This mode is based on pytorch, so I opened Ubuntu and installed pytorch using -

pip3 install torch torchvision torchaudio

pip3 install torch torchvision torchaudio

This command was generated by the pytroch website according to the options you need.

I cloned the PigNet2 repository

gh repo clone ACE-KAIST/PIGNet2

Created a conda environement

conda create -n pignet2 python=3.9
conda activate pignet2

Installed the requirements file

pip install -r requirements.txt

After this, as per their instruction, I changed directory and executed the file to download all datasets

cd PIGNet2/dataset
bash download.sh

Executed the file that extracts all downloaded datasets

bash untar.sh

I have been facing some hardware problems with its implementation in that it eats up all available space in my C drive and sometimes uses up all the RAM causing other applications to fail. I am looking into an alternative implementation that can possibly be better than running it locally.

joiboi08 commented 1 year ago

Second Model Proposition

ChemProp - A Message Passing Neural Network for Molecular Property Prediction and its Application in A Deep Learning Approach to Antibiotic Discovery

Publication - CELL
Source Code - Github
Authors : Jonathan M. Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom-Ackermann, Victoria M. Tran, Anush Chiappino-Pepe, Ahmed H. Badran, Ian W. Andrews, Emma J. Chory, George M. Church, Eric D. Brown, Tommi S. Jaakkola, Regina Barzilay, James J. Collins
Date Published - February 20, 2020

My interpretation of the model :

Their objective is to use a molecular property prediction model and use it to screen for possible new antibiotic compounds by predicting the likelihood that a molecule would inhibit the growth of E. coli.
Why? This is an important objective because globally antibiotic effectiveness is falling and this is a major health concern. According to WHO, a growing number of infections – such as pneumonia, tuberculosis, gonorrhoea, and salmonellosis – are becoming harder to treat as the antibiotics used to treat them become less effective. While antibiotic resistance occurs naturally, the misuse of antibiotics in humans and animals is accelerating the process. This misuse refers to people often stop taking the drug when they start to feel better as opposed to completing the prescribed cycle. This wipes out just enough of the bacteria to heal the person but leaves just enough that the next generation will be better used to fighting the antibiotic.

Relevance to Ersilia

This study aims for a goal that if attained, will not only greatly benefit us in the future health-wise, but the alternative for failing frontline antibiotics is more expensive specialized drugs. These drugs will be extremely hard to implement in rural areas due to their price and supply - this accounts for the lives of millions of people. This study is also inadvertently a test against e.coli, which is a disease that has terrorized sub-saharan Africa which is also Ersilia's ground of operation. These drugs are effective for the commonfolk and above but absolutely life saving for the underprivileged. The failure of antibiotics has the potential to cause one of the greatest humanitarian crises - where people can die to minor bruises and common infections.

joiboi08 commented 1 year ago

Third Model Proposition

AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens

Publication - PubMed
Source Code - Github
Authors : Chenkai Li, Darcy Sutherland, S Austin Hammond, Chen Yang, Figali Taho, Lauren Bergman, Simon Houston, René L Warren, Titus Wong, Linda M N Hoang, Caroline E Cameron, Caren C Helbing, Inanc Birol
Date Published - 25 January, 2022

My interpretation of the model :

Their objective is to use a deep learning model (AMPlify) to predict effective peptides against a panel of WHO priority pathogens.
Why? The concern it is working against is the same - ie the growing resistance to antibiotics and its globally degrading effectiveness. However, here the proposed solution is different. Unlike the second model, where the goal was to find similar compounds to antibiotics, this model sets out to find ALTERNATIVES to anitbiotics in novel antimicrobial peptides (AMPs) which are general purpose action drugs against bacteria, viruses, fungi and parasites. They use the predicted peptides against a list of WHO priority pathogens and 4 of the novel AMPs proved effective against multiple species of bacteria, including a multi-drug resistant isolate of e. coli.

Relevance to Ersilia

The relevance to Ersilia is mostly retained from other models in that it is finding a novel solution to a potentially globally devastating problem that is slowly creeping towards us. Authorities and individuals researching this problem will only increase in the coming years and I believe Ersilia's fast and low-investment implementation of this and similar models will prove instrumental. This study also gets results effective against major diseases on the WHO priority pathogen list including e. coli, which is particularly puts rural and under developed areas at risk.

joiboi08 commented 1 year ago

Just updating that I have submitted the final application through Outreachy along with a timeline on 27 October, 2023!

GemmaTuron commented 1 year ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia

✍️ Contribution period: Joyesh Banerjee #832

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

WEEK 1 - Updates

Task 3 cont.

I will now mark task 3 as completed!

Motivation Statement - Task 4

Submit your first contribution to Outreachy - Task 5

This marks the end of all Week - 1 tasks and is hence brought to closure!

Week - 2

Task 1 - Choose a model, explain why and run it locally

Task 2 - Install and run the STOUT model locally

After much trying - Success!!

Tested successfully!!

Task 2 - Complete.

Task 3 - Running the model with EML as the input dataset

The problem is resolved! 🟢 🟢

Full code for your perusal

Task 3 Completed!

Task 4 - Docker Deployment of the Ersilia Hub implementaion of the STOUT Model

BUT

Comparison between STOUT implementation and Ersilia implementation

My Interpretations

Marking Week - 2 complete!

Week - 3

First Model Proposition

My interpretation of the model :

Relevance to Ersilia

BUT

Comparison between STOUT implementation and Ersilia implementation

My Interpretations

Marking Week - 2 complete!

Running The Model

Second Model Proposition

My interpretation of the model :

Relevance to Ersilia

Third Model Proposition

My interpretation of the model :

Relevance to Ersilia