ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.

https://ersilia.io

GNU General Public License v3.0

189 stars 123 forks source link

✍️ Contribution period: Bronch Mukami #843

Closed MadeaRiggs closed 8 months ago

MadeaRiggs commented 9 months ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

MadeaRiggs commented 9 months ago

Introduction

This is a tutorial to for those who are using Windows OS(WSL)

Week 1

1) First start by installing WSL(if you don't have it) and install other prerequisites such as docker, git, etc as per this documentation: https://learn.microsoft.com/en-us/windows/wsl/install

Task 3 Install the Ersilia Model Hub and test the simplest model

2) Proceed with the prerequisite installation of Ersilia with this as the guide: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation

3) You can choose to install based on the previous guide or the Ersilia GitHub README file. There are some slight changes like change in the Python version used but the rest are the same: https://github.com/ersilia-os/ersilia

4) Serve the model, predict values and you can close the model: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/antibiotic-activity-prediction

MadeaRiggs commented 9 months ago

You may encounter an error when running :

ersilia run -i my_molecules.csv -o my_predictions.csv

The solution has been provided by LeilaYesufu here: https://github.com/ersilia-os/ersilia/issues/821#issuecomment-1744581901

MadeaRiggs commented 9 months ago

Motivation statement to work at Ersilia

Ersilia is an organization with a goal that is both audacious and exciting. My own journey and objectives in the field of data are in line with Ersilia's unwavering dedication to using data to improve society.

What draws me to Ersilia is not merely its mission but the spirit of adventure that permeates every facet of its work. It's like setting off on an exciting data trip where every dataset carries the possibility of revealing hidden insights. On this journey, data turns from being merely information into a catalyst for change. My goals for development are well matched with Outreachy's commitment to diversity and inclusivity and Ersilia's passion to mentoring. I'm ready to learn everything I can, get some real-world experience, and bring new ideas to your dynamic team.

Growing up I was interested in medicine especially forensics. I loved watching crime movies (I still do) to a point I decided I would one day work in a federal institution like the (CIA) but my love for computers was greater than forensics. It is thrilling to investigate and know more about the human body which is among the reasons why crime is among my favorite movie genres. I'm hoping this chance at Ersilia will give me an opportunity to merge my interest in medicine and technology in a way that can help people especially here in the Global south to aid in drug discovery among other causes.

Beyond formulas and code, Ersilia presents itself as a close-knit group of passionate people bound together by the desire to have a significant effect. It's reassuring to see how supportive and knowledgeable fellow members and mentors take time form their busy schedule to help those who are when faced with difficulties. This atmosphere of support for one another demonstrates the team's commitment to the mission. I am prepared to contribute my expertise and unflinching resolve to write new chapters in the amazing narrative of advancement that Ersilia creates every day.

Stepping outside of my comfort zone is scary but I'm enthusiastic about it. Although I've felt at ease as a data analyst, this chance at Ersilia catapults me into the fascinating field of data science, which covers everything from modeling to implementation. This year, I set a clear goal for myself: to develop thoroughly in terms of both skills and knowledge. I'm convinced that Ersilia will serve as my compass as I strive for excellence in the domains of data science and biomedicine.

I'm hoping that by the end of the internship, I will have sharpened my skills in Python, ML and Git and be able to make changes especially in my country to aid the biomedicine sector. The partnership between Ersilia and Outreachy offers a special chance to flourish in a setting that prioritizes diversity, inclusivity, and social impact. This make me yearn the opportunity to add to Ersilia's outstanding work and gain knowledge from the industry's top experts.

You have my most profound gratitude for giving me this priceless opportunity, and I look forward to the rewarding road of growth, discovery, and meaningful data science that lies ahead.

leilayesufu commented 9 months ago

Hi, please don't change the codebase. Try reinstalling the environment as suggested here. https://ersilia-outreachy-w23.slack.com/archives/C05V51PS6FJ/p1696451488224489

in response to https://github.com/ersilia-os/ersilia/issues/843#issuecomment-1747517316

carcablop commented 9 months ago

Hi @leilayesufu @MadeaRiggs. Thank you!. Uninstall isaura if you have installed it. Please don't change the code this error occurs when we install isaura

MadeaRiggs commented 9 months ago

Hello @carcablop and @leilayesufu , I appreciate your feedback I ended up uninstalling then reintstalling miniconda which worked but when running these commands from the README file : 4) Generate a few (5) example molecules, to be used as input. The example command will generate the adequate input for the model in use

ersilia example retrosynthetic-accessibility -n 5 -f my_molecules.csv

5) Then, serve your model:

ersilia serve retrosynthetic-accessibility

6) And run the model:

ersilia run -i my_molecules.csv -o my_predictions.csv

I get a TypeError:

File "/mnt/c/Users/Kami/ersilia/ersilia/io/readers/file.py", line 321, in read_input_columns
    if len(h) == 1:
TypeError: object of type 'NoneType' has no len()

But other codes in the Ersilia book work well and predict values for step 6

 ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
{
    "input": {
        "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N",
        "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]",
        "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
    },
    "output": {
        "outcome": [
            0.9924924
        ]
    }
}

So I don't understand what the problem could be

DhanshreeA commented 9 months ago

Hi @MadeaRiggs thank you for the updates. As for your question, it could be perhaps that this output you have shared for a single input is pre-calculated and coming from the isaura lake. That is, the model wasn't actually queried to generate this output. Thanks to @carcablop we have come to identify that isaura is causing some issues with ersilia currently, and while we try to figure out why that is happening, we ask you to uninstall isaura if you have it installed and then run the prediction commands. Hope this helps.

MadeaRiggs commented 9 months ago

Hello @DhanshreeA, thank you for your feedback. I followed what @carcablop said and uninstalled and reinstalled conda environment and this time during the reinstalling I did not use Isaura lake but I got the above error. So, I think something else could be the problem which I'm not sure if it is on my end or somethiing.

carcablop commented 9 months ago

Hi @MadeaRiggs. Have you modified the ersilia code base?,I'm not sure if you cloned the ersilia repository again. Could you list the packages you have installed in your ersilia environment?. Could you share the complete outputs logs when you run the ersilia commands to run the example model? Provide description information about your system (python version, conda version, etc.)

Thanks so much!.

MadeaRiggs commented 9 months ago

Hello @DhanshreeA , my most sincere apologies for the late reply. I had electricity issues and I didn't have enough time to recreate the error till today as it is now stable. I have began the reinstallation process again and give an update when it is done

MadeaRiggs commented 9 months ago

Hello @DhanshreeA , I have recreated the process again and I'm still getting the same error when running:

ersilia run -i my_molecules.csv -o my_predictions.csv

Here's the log file predictions_error.txt

When I verbosed the error, I saw that it required Isaura lake installed: verbosed_prediction_error.txt

But this code ran successfully:

ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"

and

ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"

Here's the log file: ersilia_output.txt

Installed dependencies : dependencies.txt

conda verion- 23.9.0 python 3.10

MadeaRiggs commented 9 months ago

Week Two

Task One : Select a model from the suggested list

I have began tasks of week two and I decided to go with this model:

Plasma Protein Binding (IDL-PPBopt)

It is found_ here: https://github.com/Louchaofeng/IDL-PPBopt

This is a prediction model PPB of chemical substance binding to human plasma proteins. PPB is a crucial pharmacokinetic factor for medications since it affects how well they are absorbed and distributed in the body. I chose this model because its advantages include:

Drug Candidate Screening: By evaluating each candidate's propensity to bind to plasma proteins, the model can be used to filter out unwanted candidates. This is crucial when assessing new substances or repurposing current medications for the treatment of infectious and underutilized illnesses.
Lowering Drug Development Costs: The process of developing new drugs is expensive and time-consuming. Early in the drug development pipeline, predicting PPB can assist in weeding out compounds that are less likely to be useful, which can save money and hasten the creation of viable medicines.
Drug Development: Increasing access to medical care and medications is Ersilia's goal. The PPB prediction model can help with drug development by offering information about the interactions between various substances and plasma proteins. To optimize medication compositions, this information is essential.

I created a new environment to install the dependencies. The dependencies required are listed in the README file which are:

Python ==3.7
Pytorch 1.5.0 which is installed by
```
pip install torch==1.5.0
```
Openbabel 2.4.1 which is installed by running
```
pip install openbabel
```
By default it will install version 2.4.1 but it is good to confirm The rest of the dependencies were installed using "pip install"
Rdkit - A collection of machine learning and cheminformatics tools.
Scikit learn
Scipy- scientific computations
Cairosvg- visualization of molecules

DhanshreeA commented 8 months ago

Hello @DhanshreeA , I have recreated the process again and I'm still getting the same error when running:
ersilia run -i my_molecules.csv -o my_predictions.csv
Here's the log file predictions_error.txt

When I verbosed the error, I saw that it required Isaura lake installed: verbosed_prediction_error.txt

But this code ran successfully:
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
and
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
Here's the log file: ersilia_output.txt

Installed dependencies : dependencies.txt

conda verion- 23.9.0 python 3.10

Thank you for your exhaustive efforts with this @MadeaRiggs. We will look into why this behavior is so flaky across inputs (whether Isaura is installed or not)

DhanshreeA commented 8 months ago

Week Two

Task One : Select a model from the suggested list

I have began tasks of week two and I decided to go with this model:

Plasma Protein Binding (IDL-PPBopt)

It is found_ here: https://github.com/Louchaofeng/IDL-PPBopt

This is a prediction model PPB of chemical substance binding to human plasma proteins. PPB is a crucial pharmacokinetic factor for medications since it affects how well they are absorbed and distributed in the body. I chose this model because of the advantages of this model are:
* It helps to improve pharmacokinetic understanding

* Optimize drug candidates for increased safety and efficacy

* Contribute to the effectiveness and success of drug development
  This may significantly affect the pharmaceutical sector by lowering costs, expediting the drug discovery process, and enhancing patient outcomes especially in Third world countries.
I created a new environment to install the dependencies. The dependencies required are listed in the README file which are:
* Python ==3.7

* Pytorch 1.5.0 which is installed by
pip install torch==1.5.0
* Openbabel 2.4.1 which is installed by running
pip install openbabel
By default it will install version 2.4.1 but it is good to confirm The rest of the dependencies were installed using pip install
* Rdkit

* Scikit learn

* Scipy

* Cairosvg
Task Two : Install the model in your system

I clone the repository :
git clone https://github.com/Louchaofeng/IDL-PPBopt
cd IDL-PPBopt
Here you'll find the saved model, input csv file, python notebook and other folders

Task Three : Run predictions for the EML

I ran the notebook on VS Code. This is after setting up the required extensions which are:

@MadeaRiggs any further updates with this?

MadeaRiggs commented 8 months ago

Hello @DhanshreeA , yes I have some updates and do require some assistance with some information

Task Two : Install the model in your system

I clone the repository :

git clone https://github.com/Louchaofeng/IDL-PPBopt
cd IDL-PPBopt

Here you'll find the saved model, input csv file, python notebook and other folders

I ran the notebook on VS Code. This is after setting up the required extensions which are:

WSL
Remote: Explorer, SSh, Tunnels, Development
GitHub
Dev containers
Jupyter
Python

Running the notebook, I noticed other dependencies were required which were installed using conda in the environment:

Matplotlib
Pandas
Pickle

Continuing to run the notebook in the first cell, I kept getting this error which actually took a lot of time fixing it:

running cells with plasmaprotein requires the ipykernel to be installed or updated

In this case, plasmaprotein is the name of my conda environment. I had tried several solutions but the ones that worked were:

1) VS Code suggestion- executed in the WSL or VS Code terminal after activating the environment

conda install -n plasmaprotein ipykernel --update-deps --force-reinstall

This ran successfully but when running the cell, it still gave the same error. That led me to this:

python -m ipykernel install --user --name plasmaprotein

This resulted in an error:

ImportError: cannot import name 'secure_write' from 'jupyter_core.paths' (/home/ubuntu/miniconda3/envs/plasmaprotein/lib/python3.7/site-packages/jupyter_core/paths.py)

To solve this, I found the solution here: https://github.com/jupyter/notebook/issues/5014#issuecomment-547762322

pip install --upgrade jupyter_client

Then run "python -m ipykernel install" command again and it runs successfully.

Reload the VS Code, reactivate the environment or restart your machine if need be which I did and after all this, finally the first cell ran successfully.

2) CUDA This cell raised an error:

batch_size = 64
p_dropout= 0.1
fingerprint_dim = 200

weight_decay = 5 # also known as l2_regularization_lambda
learning_rate = 2.5
output_units_num = 1 # for regression model
radius = 2
T = 2

x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)
num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]
loss_function = nn.MSELoss()
model = Fingerprint(radius, T, num_atom_features, num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model.cuda()

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)

model.load_state_dict(best_model_wts)
(best_model.align[0].weight == model.align[0].weight).all()

model_for_viz = Fingerprint_viz(radius, T, num_atom_features, num_bond_features,
            fingerprint_dim, output_units_num, p_dropout)
model_for_viz.cuda()

model_for_viz.load_state_dict(best_model_wts)
(best_model.align[0].weight == model_for_viz.align[0].weight).all()

ERROR:

AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

I fixed it by commenting out the following lines of code:

model.cuda()
model_for_viz.cuda()

Going through the code I noticed that most instances of the "torch." code was followed by "cuda" and since my machine does not have GPU, I had to remove the "cuda" in the notebook and the AttentiveFP folder to remain with for example:

torch.FloatTensor()

And in this line of code, I specified for it to be mapped to cpu:

best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')

The last cell had a syntax error by having an extra parenthesis in the "f.write function", but it was fixed:

f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str(r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')

After all that, rest of the notebook ran successfully and these were the results:

MadeaRiggs commented 8 months ago

Task Three : Run predictions for the EML

MODEL SUMMARY The model process was as follows:

Importing libraries, data and custom models from AttentitiveFP folder
Data preprocessing to remove invalid SMILES strings remaining with a dataframe that consists of valid compounds and their SMILES strings
Generating molecular fingerprints of the data using the imported custom models
Using the pretrained deep learning model ('saved_models/model_ppb_3922_Tue_Dec_22_22-23-222020') to predict the PPB property values for a subset of the compounds.
Calculate the statistics of the substructures that are significant (SAs) and non significant (non-SAs) for the "ppb" attribute and visualize different compounds.

RESULTS

Step 4: Predict Values

After running:

remain_pred_list = eval(model, remained_df)
remained_df['Predicted_values'] = remain_pred_list
remained_df

OUTPUT

Based on my research to fathom how the model works, I stand to be corrected, but it seems that these values are the predicted PPB fractions of a compound.

Step 6: Identify Privileged Substructure for each molecule

The model iterated through various substructures of the compounds. The processes are:

1) Generate substructure patterns (substructure fragments) using the SMILES notation of the chemical molecule using the RDKit's Chem.MolFromSmarts() function.

2) The chemical compound's SMILES notation is used by the code to generate an RDKit molecule object using Chem.MolFromSmiles(). It identifies the compound's atom numbers that correspond to the substructure pattern.

3) Calculate p-values for each substructure using the Mann-Whitney U test. Two sets of values are subjected to the test:

Fragment_atoms: The values (features) connected to the substructure's atoms.
Rest_atoms: The characteristics (values) of the compound's remaining atoms, excluding those in the substructure. The test is run with the alternative hypothesis set to "greater," which means that it is searching for substructures connected to noticeably higher values.

CRITERIA

Substructures that have p-values less than 0.05 are considered privileged substructures. These are substructures that are associated with significantly higher feature values compared to the rest of the compound. From the four compound inputs, only one did not have privileged substructures while the rest had privileged substructures.

Compound without Privileged Substructures

O=C(O)CC(c1ccccc1)n1ccc2cc(OCCc3ccc4c(n3)NCCC4)ccc21
[]
Predicted PPB fraction: 0.9707257
Dectected Priviledged Substructures: []

Compound with Privileged Substructures

CN(C)Cc1cncc(C(CC(=O)O)n2ccc3cc(OCCc4ccc5c(n4)NCCC5)ccc32)c1
['*Cc1cncc(C(C*)*)c1']
Predicted PPB fraction: 0.8506337
Dectected Priviledged Substructures: ['*Cc1cncc(C(C*)*)c1']

The final output, which was the "Results.smi" had the following values:

SA_Fragment NAS Score   RES CES ZES NTS
CN1CC(*)C1  *c3ccccc3   0.14786313364055315 0   23  41  131
CN1CC(*)C1  *C(=O)NC2C(=O)N3C(=CCSC23)C(=O)O    -0.1598742305685968 57  0   0   0
CN1CC(*)C1  *C3CN(CCO)C3    -0.4474728208647905 106 0   0   0
CN1CC(*)C1  CC=C*   -0.29621884951206984    265 7   4   12
CN1CC(*)C1  *C(=O)NC1C(=O)N2C(=C(C)CSC12)C(=O)O -0.15223166491043205    55  0   0   0
CN1CC(*)C1  *C(=O)C(O)* 0.21906800000000004 16  0   4   34
CN1CC(*)C1  *CSC*   -0.3888930158730158 278 14  5   0
CN1CC(*)C1  *C1=C(C(=O)O)N2C(=O)C(NC(=O)C*)C2SC1    -0.1627462264150944 73  0   0   0

SA_Fragment- The substructure pattern designated for the "ppb" attribute
NAS- Number of Non-Significant Substructures (non-SAs) connected to each SA
Score- The level of each SA's statistical significance
Reliable Substructures (RES)- These SAs are directly connected to the originating SA and have a minimum distance from it of 0. Its presence shows that the SA and the "ppb" characteristic are closely related
Connected Substructures (CES)- These SAs are bound to the parent SA by a single bond and are separated from it by a minimum distance of 1
Zero-distance Substructures (ZES): it reveals the quantity of substructures that are at least 0 distances from the SA. They are associated with the SA directly
Non-Target SAs(NTS)- These are auxiliary constructions that are not a part of the SA

For example this compound:

CN1CC()C1

These are the values:

NAS: c3ccccc3 (19 non-significant substructures)
Score: 0.14786313364055315
RES: 0
CES: 23
ZES: 41
NTS: 131

I didn't fully understand the role of the "ppb_3922.csv" and this code but I'm seeking more information and I'm requesting @DhanshreeA to kindly assist. Here are the results after reading the ppb_3922.csv file:

CN1CC(*)C1 matches 125 compounds
Totally find 5062 fragments
For CN1CC(*)C1 totally find 8 second-level substructures!

MadeaRiggs commented 8 months ago

Task Four : Compare results with the Ersilia Model Hub implementation!

I got the "eml_canonical.csv" dataset from Ersilia's Essential Medical list. I made a copy of the "IDL-PPBopt.ipynb" notebook to run this dataset as the input file.

ERRORS

Step 2: Calculate molecular feature

When running this code,

feature_dicts = save_smiles_dicts(smilesList,filename)
remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
print(str(len(uncovered_df.cano_smiles))+' compounds cannot be featured')
remained_df = remained_df.reset_index(drop=True)

I got this error:

TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float

Upon further investigation, I noticed that in the previous cell output in

Step 1: Prepare Input file

When loading the dataset using this code,

task_name = 'ppb'
tasks = ['endpoint']

raw_filename = "eml_canonical.csv"
feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')
smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.cano_smiles.values
print("number of all smiles: ",len(smilesList))
atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []
for smiles in smilesList:
    try:
        mol = Chem.MolFromSmiles(smiles)
        atom_num_dist.append(len(mol.GetAtoms()))
        remained_smiles.append(smiles)
        canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
    except:
        print(smiles)
        pass
print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(remained_smiles)]
# print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] =canonical_smiles_list
assert canonical_smiles_list[0]==Chem.MolToSmiles(Chem.MolFromSmiles(smiles_tasks_df['cano_smiles'][0]), isomericSmiles=True)

The output was

number of all smiles:  443
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
nan
number of successfully processed smiles:  442
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors

There was a nan value in the dataset. To solve this issue, I found this solution referenced here: https://github.com/rdkit/rdkit/issues/2994#issuecomment-1026085560

I removed the nan value in Step 1 code cell using:

smiles_tasks_df = smiles_tasks_df[smiles_tasks_df.cano_smiles.notna()]

The output was:

number of all smiles:  442
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
number of successfully processed smiles:  442
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors

When Calculating the molecular feature, this was the output:

[CaH2]
[23:13:08] WARNING: not removing hydrogen atom without neighbors
[23:13:08] WARNING: not removing hydrogen atom without neighbors
[F-]
[23:13:12] WARNING: not removing hydrogen atom without neighbors
[23:13:12] WARNING: not removing hydrogen atom without neighbors
[I]
[23:13:16] WARNING: not removing hydrogen atom without neighbors
[23:13:16] WARNING: not removing hydrogen atom without neighbors
O
[Cl-].[K+]
[I-].[K+]
S
[23:13:29] WARNING: not removing hydrogen atom without neighbors
[23:13:29] WARNING: not removing hydrogen atom without neighbors
N.N.[Ag+].[F-]
[Cl-].[Na+]
[23:13:36] WARNING: not removing hydrogen atom without neighbors
[23:13:36] WARNING: not removing hydrogen atom without neighbors
feature dicts file saved as eml_canonical.pickle
9 compounds cannot be featured

RESULTS

Step 4: Predict Values

After running:

remain_pred_list = eval(model, remained_df)
remained_df['Predicted_values'] = remain_pred_list
remained_df

OUTPUT

The predicted PPB fraction values for various drugs are as shown above

Step 6: Identify Privileged Substructure for each molecule

The procedure was the same as the initial dataset. I was only able to load four compounds in this code without killing the kernel :laughing: I had to restart the kernel several times because it was dead and ended up reducing the values to match the initial four input compounds, then it worked. As a caution, I had to run each cell on its own instead of "Run all".

For these drugs, two had Privileged substructures while the other two did not have Privileged substructures:

Drug without Privileged Substructures

CC(=O)Nc1nnc(S(N)(=O)=O)s1
[]
Predicted PPB fraction: 0.5828322
Dectected Priviledged Substructures: []

eml_canonical2

Drug with Privileged Substructures

C[C@]12CC[C@H](O)CC1=CC[C@@H]1[C@@H]2CC[C@]2(C)C(c3cccnc3)=CC[C@@H]12
['*c1cccnc1']
Predicted PPB fraction: 0.97662824
Dectected Priviledged Substructures: ['*c1cccnc1']

eml_canonical1

MadeaRiggs commented 8 months ago

Task Five : Install and run Docker!

The model was downloaded using the command:

docker pull ersiliaos/eos22io

output: docker_output1.txt

To run the model:

docker run ersiliaos/eos22io

Afterwards, Docker crashed due to limited space and I have not been able to continue. I'm still trying to find a way to fix this

MadeaRiggs commented 8 months ago

Week Three

Task One : Suggest a new model and document it (1)

TITLE: Drug Combination Extraction

Model Description

For conditions like cancer, TB, malaria, and HIV, combination treatments have emerged as the gold standard of treatment. Finding efficient combination medicines when they are accessible in a given circumstance is difficult, nevertheless, due to the combinatorial set of multi-drug treatments that are currently available. Drug combination is advantageous because of less drug resistance and when used in low dosages.

Reasons why Ersilia should adopt the model

Effective Drug Combination Discovery: One of Ersilia's main goals is to treat serious illnesses like HIV, TB, cancer, and malaria. Combination therapies are frequently very successful in addressing these conditions. In environments with limited resources, this strategy can help with the effective discovery of medication combinations.
Better Access to Efficient Treatments: The model assists in determining and maximizing the use of combination medications. This may result in more affordable and easily accessible treatment alternatives, particularly in settings with limited resources where access to cutting-edge medical treatments may be restricted.
Overcoming Combinatorial Complexity: Because there are so many possible combinations, it might be difficult to find successful medicine combinations. In order to navigate this complexity and make it simpler to find viable combinations, this model makes use of machine learning and data analysis.
Decreased Drug Resistance: Combination treatments have a reputation for being able to slow down the emergence of drug resistance. The capacity of this model to pinpoint drug combinations that are less likely to result in resistance can be advantageous to Ersilia, increasing the efficacy of therapies.
Optimal Dosage: The model can assist in figuring out how much is best for combination treatments. In situations when resources for treating side effects are scarce, this is especially important for guaranteeing that therapies are effective while limiting side effects and the risk of toxicity.
Scientific Progress: Ersilia advances the field of drug combination research by implementing this methodology. This is in line with Ersilia's mission to use data science and technology to further investigate and solve urgent global health issues.

How the model works

Create a Conda environment, activate it and install the required dependencies
The dataset has already been split into data/final_train_set.jsonl and data/final_test_set.jsonl
You can run the Base line model found at the HuggingFace website which details how to dowload the model, associated code and how to load the model.
Model Training: To train the model, run the Python script train.py which imports the pubmedbert_2021 model(BERT-based model fine-tuned for biomedical text data from PubMed) and trains it through 10 epochs, learning rate of 2e-4 (0.0002) and batch size of 18.
- It uses the data/final_train_set.jsonl dataset to train
- It sets the context-window size to 400 and maximum sequence length to 512 to either split or truncate the tokens whose length is greater than 512
- It then maps the labels to indices for the model to understand the target labels
- It sets the random seed to 2022 to ensure that the training process is consistent and specifies the strategy for unfreezing the model layers to decide which features of the taught model will be improved during training.
Model Testing and Validation: To test and evaluate the model, run the Python script test_only.py which specifies the location of the checkpoint directory that contains the trained model and its related files.
- It uses the data/final_test_set.jsonl dataset for testing, setting the batch size to 100 and random seed to 2022.
- The output is saved in the checkpoints_pubmedbert_2022/outputs/ directory

Characteristics:

Input Data: Sentences, Drug spans, Wider context in a sentence
Task: Drug combination Extraction
Output Data: 3-way Labeled relations (each relation is a group of drugs). Labels used per span group:
- POS_COMB (Positive Combination)- The combined drugs achieved a positive outcome
- OTHER_COMB (Other Combination)- The combined drugs did not achieve anything
- NO_COMB (Not Combination)- The drugs are not a combination.

References

Link to Publication:

Source Code and Dependencies: https://github.com/allenai/drug-combo-extraction

Contributors:

As per the GitHub Repo Reference requirement :

@inproceedings{Tiktinsky2022ADF, title = "A Dataset for N-ary Relation Extraction of Drug Combinations", author = "Tiktinsky, Aryeh and Viswanathan, Vijay and Niezni, Danna and Meron Azagury, Dana and Shamay, Yosi and Taub-Tabib, Hillel and Hope, Tom and Goldberg, Yoav", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.233", doi = "10.18653/v1/2022.naacl-main.233", pages = "3190--3203", }

MadeaRiggs commented 8 months ago

Task Two : Suggest a new model and document it (2)

TITLE: Chemprop_Abaucin

Model Description

Acinetobacter baumannii is a nosocomial Gram-negative pathogen that often displays multidrug resistance. Through standard screening techniques, it has been difficult to find new antibiotics to treat A. baumannii. Luckily, rapid chemical space research is made possible by machine learning techniques, which raises the likelihood of finding new antibacterial compounds. They screened 7,500 compounds to find those that prevented A. baumannii from growing in vitro. Using this growth inhibition dataset, they trained a neural network and used it to make in silico predictions for structurally new compounds that exhibit anti-A. baumannii action. Using this method, they were able to identify the antibacterial chemical Abaucin, which has a restricted range of action against A. baumannii. Subsequent research demonstrated that Abaucin affects lipoprotein trafficking by means of LolE- a protein related to the lipoprotein trafficking pathway, which is responsible for the sorting and transport of lipoproteins to the outer membrane of Gram-negative bacteria. More information about Gram-Negative Lipoprotein Trafficking can be found here. Additionally, Abaucin could manage an infection caused by A. baumannii in a mouse wound model.

Reasons why Ersilia should adopt the model

Targeted Antibiotics: It was found that Abacin disrupts lipoprotein trafficking by using LolE. By focusing on specific areas, this strategy can lower the likelihood of widespread antibiotic resistance and offer better therapeutic alternatives. Similar tactics can be used by Ersilia to create antibiotics that target microorganisms related to neglected diseases particularly.
New Antibiotic Discovery: Abacin is a newly discovered antibiotic that exhibits narrow-spectrum action against the difficult Gram-negative infection Acinetobacter baumannii. By incorporating this research, Ersilia can contribute to the discovery of additional antibiotics for combating infectious diseases in low-resourced regions.
Reduced Cost and Resource Requirements: When compared to conventional approaches, machine learning-based drug development and screening can be more affordable and demand less resources. This benefit is consistent with Ersilia's goal of assisting institutions in underdeveloped nations since it permits significant research to be conducted even with constrained funding and resources.

How the model works

To understand how the model works, there is a need to know how Chemprop works, as the model is based on it.

Chemprop

It is a direct message passing neural network that predicts the likelihood of a given molecule inhibiting the growth of a specific bacteria. MPNNs aggregate local chemical features iteratively in order to predict properties. It works by using a direct bond-to-bond-based message-passing approach. It iteratively aggregates the features of every single individual atom and bond. For example, atom 2 can have details about the structures of atom 1, atom 3 and atom 4 which form the vector representation of atom 2. In summary, It traverses around the molecule, creates vector representations, and passes messages from atom to atom by traveling across each of the bonds.

Steps for this model

Install the Chemprop package and its dependencies using these instructions
To train a new model with different data, replace data paths as needed.
To use the model to predict antimicrobial activities on other datasets, these instructions are provided to perform these predictions
Details are given for training a Chemprop model using specific parameters, including features generation and model configuration.
Activity scores produced by the model are used as the primary method for prioritizing molecules.
The model is used to make predictions on the Broad Drug Repurposing Hub dataset, which is used as a prediction set.
Instructions for performing predictions on the Broad Drug Repurposing Hub dataset using the trained models.
Tanimoto similarity calculations are performed using Morgan fingerprints. The code returns the maximum similarity and the corresponding SMILES of the most similar molecule.
The data files may require additional manual processing, such as header removal or binarization, to be correctly formatted for the code.

Characteristics:

Input Data: SMILES strings and their properties
Task: Prediction
Output Data: Abaucin compound

References

Link to Publication:

Source Code and Dependencies: https://github.com/GaryLiu152/chemprop_abaucin/tree/main

Contributors: Gary Liu, Denise B. Catacutan, Khushi Rathod, Jody C. Mohammed, Meghan Fragis, Kenneth Rachwalski, Jakob Magolan, Brian K. Coombes & Jonathan M. Stokes

MadeaRiggs commented 8 months ago

Task Three : Suggest a new model and document it (3)

TITLE: Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models

Model Description

An essential function of the human {\it ether-a-go-go} (hERG) potassium channel (Kv11.1) is to mediate the cardiac action potential. This ion channel's blockage may result in long QT syndrome or a lethal disorder. A number of medications have been discontinued due to significant hERG-cardiotoxicity. In the first stages of drug discovery, it is imperative to evaluate the hERG blocking activity. The hERG-cardiotoxicity of compounds found in the DrugBank database is of special interest to them since several of these compounds have been licensed for use as medicinal treatments or have a strong potential for development into pharmaceuticals. In silico methods based on machine learning provide a quick and affordable way to virtually screen DrugBank molecules.

The binding efficacy of the DrugBank compounds on the hERG channel is quantitatively analyzed by means of regressors that they constructed after designing strong and accurate classifiers for blockers/non-blockers. Two natural language processing (NLP) techniques, autoencoder and transformer, are used to insert molecular sequences. Complementary three-dimensional (3D) molecule structures are embedded using two sophisticated mathematical techniques: algebraic graphs and topological Laplacians. Using their cutting-edge instruments, they found that 227 of the 8641 DrugBank compounds may be hERG blockers, indicating significant issues with drug safety. Their forecasts offer direction for additional experimental investigation into the hERG-cardiotoxicity of DrugBank drugs.

Reasons why Ersilia should adopt the model

Drug Safety Assessment: Severe cardiac problems can result from blocking the hERG potassium channel (Kv11.1), which is essential for modulating the cardiac action potential. Evaluating the safety and effectiveness of medications is for Ersilia's work, particularly in areas with limited resources. By concentrating on hERG-cardiotoxicity prediction, Ersilia can improve their ability to assess drug safety by implementing this approach.
Drugs discontinued: A number of drugs have been pulled off the market because of serious hERG-cardiotoxicity, which emphasizes how crucial it is to identify any problems early on. By utilizing this approach, Ersilia can pinpoint drugs that provide a danger of hERG blocking, so impeding the creation of cardiotoxic medications and advocating for more secure substitutes.
DrugBank Database: Ersilia is interested in the DrugBank database since it contains substances that have a great chance of being developed into pharmaceuticals. They can efficiently screen the drugs in DrugBank with this model by identifying any hERG blockers early in the drug discovery process. This can encourage the creation of safer drugs while saving time and money.
In Silico Screening: The machine learning-based in silico techniques used in this model offer a rapid and economical means of virtually screening compounds. Ersilia frequently works in contexts with limited resources, thus this strategy fits with their goal of giving researchers access to data science tools. It enables a quick first evaluation of possible therapeutic candidates.

How the model works

The DrugBank database, which includes details on a variety of chemicals, including FDA-approved medications and experimental pharmaceuticals, provides the training and assessment data for the model. The model is trained and evaluated on many datasets with "yes" or "no" labels (indicating hERG blockers or non-blockers). Interestingly, the model captures a broad range of molecular structures and chemical diversities by incorporating data from many sources.

Feature Engineering: To extract pertinent information from chemical structures, the model makes use of sophisticated feature engineering techniques. Sequence-based embeddings and 3D structure-based embeddings are two different forms of embeddings that are integrated.

Sequence-Based Embeddings: Natural language processing (NLP) techniques are used to build these embeddings. Molecular sequences are specifically processed using two techniques, Transformer and autoencoder, which enable the model to extract crucial information from the sequences.
3D Structure-Based Embeddings: Using sophisticated mathematical techniques, such as algebraic graphs and topological Laplacians, two 3D embeddings are produced. These embeddings record three-dimensional information about the stereochemical and physical characteristics of molecules.

Machine Learning Algorithms: The model uses machine learning algorithms for both classification and regression tasks.

Classification: Initially, the model employs a binary classification model to predict potential hERG blockers among DrugBank compounds. This classification model categorizes compounds as either hERG blockers or non-blockers based on the information extracted from the embeddings.
Regression: Subsequently, a regression model is used to quantitatively analyze the binding affinity of predicted hERG blockers. This step provides a measure of the strength of the interaction between compounds and the hERG channel.

Model Ensemble: The model combines the results from multiple machine learning models. It integrates the predictions from six classification models, each utilizing different combinations of the feature embeddings and machine learning algorithms. Consensus results are derived by averaging the probabilities generated by these models.

Model Evaluation: The model's performance is evaluated on various datasets, including those with different origins. It is compared against other published models to assess its predictive capabilities.

Steps for this model

Ensure you have the CentOS Linux 7 Core
Clone the repository using Git and install its dependencies from here
Download and install the pretrained model under the downloaded hERG_prediction folder
Download the trained classification model for predicting hERG blockers/non-blockers
Generate features and carry out predictions

Characteristics:

Input Data: Labeled dataset with SMILES strings,
Task: Binary Classification, Prediction
Output Data: Yes or no (hERG blockers or non-blockers)

References

Link to Publication:

Source Code and Dependencies: https://github.com/WeilabMSU/hERG-prediction#virtual-screening-of-drugbank-database-for-herg-blockers-using-topological-laplacian-assisted-ai-models

Contributors: Hongsong Feng Guo-Wei Wei

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!