Closed MadeaRiggs closed 8 months ago
This is a tutorial to for those who are using Windows OS(WSL)
1) First start by installing WSL(if you don't have it) and install other prerequisites such as docker, git, etc as per this documentation: https://learn.microsoft.com/en-us/windows/wsl/install
2) Proceed with the prerequisite installation of Ersilia with this as the guide: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/installation
3) You can choose to install based on the previous guide or the Ersilia GitHub README file. There are some slight changes like change in the Python version used but the rest are the same: https://github.com/ersilia-os/ersilia
4) Serve the model, predict values and you can close the model: https://ersilia.gitbook.io/ersilia-book/ersilia-model-hub/antibiotic-activity-prediction
You may encounter an error when running :
ersilia run -i my_molecules.csv -o my_predictions.csv
The solution has been provided by LeilaYesufu here: https://github.com/ersilia-os/ersilia/issues/821#issuecomment-1744581901
Ersilia is an organization with a goal that is both audacious and exciting. My own journey and objectives in the field of data are in line with Ersilia's unwavering dedication to using data to improve society.
What draws me to Ersilia is not merely its mission but the spirit of adventure that permeates every facet of its work. It's like setting off on an exciting data trip where every dataset carries the possibility of revealing hidden insights. On this journey, data turns from being merely information into a catalyst for change. My goals for development are well matched with Outreachy's commitment to diversity and inclusivity and Ersilia's passion to mentoring. I'm ready to learn everything I can, get some real-world experience, and bring new ideas to your dynamic team.
Growing up I was interested in medicine especially forensics. I loved watching crime movies (I still do) to a point I decided I would one day work in a federal institution like the (CIA) but my love for computers was greater than forensics. It is thrilling to investigate and know more about the human body which is among the reasons why crime is among my favorite movie genres. I'm hoping this chance at Ersilia will give me an opportunity to merge my interest in medicine and technology in a way that can help people especially here in the Global south to aid in drug discovery among other causes.
Beyond formulas and code, Ersilia presents itself as a close-knit group of passionate people bound together by the desire to have a significant effect. It's reassuring to see how supportive and knowledgeable fellow members and mentors take time form their busy schedule to help those who are when faced with difficulties. This atmosphere of support for one another demonstrates the team's commitment to the mission. I am prepared to contribute my expertise and unflinching resolve to write new chapters in the amazing narrative of advancement that Ersilia creates every day.
Stepping outside of my comfort zone is scary but I'm enthusiastic about it. Although I've felt at ease as a data analyst, this chance at Ersilia catapults me into the fascinating field of data science, which covers everything from modeling to implementation. This year, I set a clear goal for myself: to develop thoroughly in terms of both skills and knowledge. I'm convinced that Ersilia will serve as my compass as I strive for excellence in the domains of data science and biomedicine.
I'm hoping that by the end of the internship, I will have sharpened my skills in Python, ML and Git and be able to make changes especially in my country to aid the biomedicine sector. The partnership between Ersilia and Outreachy offers a special chance to flourish in a setting that prioritizes diversity, inclusivity, and social impact. This make me yearn the opportunity to add to Ersilia's outstanding work and gain knowledge from the industry's top experts.
You have my most profound gratitude for giving me this priceless opportunity, and I look forward to the rewarding road of growth, discovery, and meaningful data science that lies ahead.
Hi, please don't change the codebase. Try reinstalling the environment as suggested here. https://ersilia-outreachy-w23.slack.com/archives/C05V51PS6FJ/p1696451488224489
in response to https://github.com/ersilia-os/ersilia/issues/843#issuecomment-1747517316
Hi @leilayesufu @MadeaRiggs. Thank you!. Uninstall isaura if you have installed it. Please don't change the code this error occurs when we install isaura
Hello @carcablop and @leilayesufu , I appreciate your feedback I ended up uninstalling then reintstalling miniconda which worked but when running these commands from the README file : 4) Generate a few (5) example molecules, to be used as input. The example command will generate the adequate input for the model in use
ersilia example retrosynthetic-accessibility -n 5 -f my_molecules.csv
5) Then, serve your model:
ersilia serve retrosynthetic-accessibility
6) And run the model:
ersilia run -i my_molecules.csv -o my_predictions.csv
I get a TypeError:
File "/mnt/c/Users/Kami/ersilia/ersilia/io/readers/file.py", line 321, in read_input_columns
if len(h) == 1:
TypeError: object of type 'NoneType' has no len()
But other codes in the Ersilia book work well and predict values for step 6
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
{
"input": {
"key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N",
"input": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]",
"text": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
},
"output": {
"outcome": [
0.9924924
]
}
}
So I don't understand what the problem could be
Hi @MadeaRiggs thank you for the updates. As for your question, it could be perhaps that this output you have shared for a single input is pre-calculated and coming from the isaura lake. That is, the model wasn't actually queried to generate this output. Thanks to @carcablop we have come to identify that isaura is causing some issues with ersilia currently, and while we try to figure out why that is happening, we ask you to uninstall isaura if you have it installed and then run the prediction commands. Hope this helps.
Hello @DhanshreeA, thank you for your feedback. I followed what @carcablop said and uninstalled and reinstalled conda environment and this time during the reinstalling I did not use Isaura lake but I got the above error. So, I think something else could be the problem which I'm not sure if it is on my end or somethiing.
Hi @MadeaRiggs. Have you modified the ersilia code base?,I'm not sure if you cloned the ersilia repository again. Could you list the packages you have installed in your ersilia environment?. Could you share the complete outputs logs when you run the ersilia commands to run the example model? Provide description information about your system (python version, conda version, etc.)
Thanks so much!.
Hello @DhanshreeA , my most sincere apologies for the late reply. I had electricity issues and I didn't have enough time to recreate the error till today as it is now stable. I have began the reinstallation process again and give an update when it is done
Hello @DhanshreeA , I have recreated the process again and I'm still getting the same error when running:
ersilia run -i my_molecules.csv -o my_predictions.csv
Here's the log file predictions_error.txt
When I verbosed the error, I saw that it required Isaura lake installed: verbosed_prediction_error.txt
But this code ran successfully:
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
and
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
Here's the log file: ersilia_output.txt
Installed dependencies : dependencies.txt
conda verion- 23.9.0 python 3.10
I have began tasks of week two and I decided to go with this model:
It is found_ here: https://github.com/Louchaofeng/IDL-PPBopt
This is a prediction model PPB of chemical substance binding to human plasma proteins. PPB is a crucial pharmacokinetic factor for medications since it affects how well they are absorbed and distributed in the body. I chose this model because its advantages include:
I created a new environment to install the dependencies. The dependencies required are listed in the README file which are:
pip install torch==1.5.0
pip install openbabel
By default it will install version 2.4.1 but it is good to confirm The rest of the dependencies were installed using "pip install"
Hello @DhanshreeA , I have recreated the process again and I'm still getting the same error when running:
ersilia run -i my_molecules.csv -o my_predictions.csv
Here's the log file predictions_error.txt
When I verbosed the error, I saw that it required Isaura lake installed: verbosed_prediction_error.txt
But this code ran successfully:
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
and
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
Here's the log file: ersilia_output.txt
Installed dependencies : dependencies.txt
conda verion- 23.9.0 python 3.10
Thank you for your exhaustive efforts with this @MadeaRiggs. We will look into why this behavior is so flaky across inputs (whether Isaura is installed or not)
Week Two
Task One : Select a model from the suggested list
I have began tasks of week two and I decided to go with this model:
Plasma Protein Binding (IDL-PPBopt)
It is found_ here: https://github.com/Louchaofeng/IDL-PPBopt
This is a prediction model PPB of chemical substance binding to human plasma proteins. PPB is a crucial pharmacokinetic factor for medications since it affects how well they are absorbed and distributed in the body. I chose this model because of the advantages of this model are:
* It helps to improve pharmacokinetic understanding * Optimize drug candidates for increased safety and efficacy * Contribute to the effectiveness and success of drug development This may significantly affect the pharmaceutical sector by lowering costs, expediting the drug discovery process, and enhancing patient outcomes especially in Third world countries.
I created a new environment to install the dependencies. The dependencies required are listed in the README file which are:
* Python ==3.7 * Pytorch 1.5.0 which is installed by
pip install torch==1.5.0
* Openbabel 2.4.1 which is installed by running
pip install openbabel
By default it will install version 2.4.1 but it is good to confirm The rest of the dependencies were installed using pip install
* Rdkit * Scikit learn * Scipy * Cairosvg
Task Two : Install the model in your system
I clone the repository :
git clone https://github.com/Louchaofeng/IDL-PPBopt cd IDL-PPBopt
Here you'll find the saved model, input csv file, python notebook and other folders
Task Three : Run predictions for the EML
I ran the notebook on VS Code. This is after setting up the required extensions which are:
@MadeaRiggs any further updates with this?
Hello @DhanshreeA , yes I have some updates and do require some assistance with some information
I clone the repository :
git clone https://github.com/Louchaofeng/IDL-PPBopt
cd IDL-PPBopt
Here you'll find the saved model, input csv file, python notebook and other folders
I ran the notebook on VS Code. This is after setting up the required extensions which are:
Running the notebook, I noticed other dependencies were required which were installed using conda in the environment:
Continuing to run the notebook in the first cell, I kept getting this error which actually took a lot of time fixing it:
running cells with plasmaprotein requires the ipykernel to be installed or updated
In this case, plasmaprotein is the name of my conda environment. I had tried several solutions but the ones that worked were:
1) VS Code suggestion- executed in the WSL or VS Code terminal after activating the environment
conda install -n plasmaprotein ipykernel --update-deps --force-reinstall
This ran successfully but when running the cell, it still gave the same error. That led me to this:
python -m ipykernel install --user --name plasmaprotein
This resulted in an error:
ImportError: cannot import name 'secure_write' from 'jupyter_core.paths' (/home/ubuntu/miniconda3/envs/plasmaprotein/lib/python3.7/site-packages/jupyter_core/paths.py)
To solve this, I found the solution here: https://github.com/jupyter/notebook/issues/5014#issuecomment-547762322
pip install --upgrade jupyter_client
Then run "python -m ipykernel install" command again and it runs successfully.
Reload the VS Code, reactivate the environment or restart your machine if need be which I did and after all this, finally the first cell ran successfully.
2) CUDA This cell raised an error:
batch_size = 64
p_dropout= 0.1
fingerprint_dim = 200
weight_decay = 5 # also known as l2_regularization_lambda
learning_rate = 2.5
output_units_num = 1 # for regression model
radius = 2
T = 2
x_atom, x_bonds, x_atom_index, x_bond_index, x_mask, smiles_to_rdkit_list = get_smiles_array([canonical_smiles_list[0]],feature_dicts)
num_atom_features = x_atom.shape[-1]
num_bond_features = x_bonds.shape[-1]
loss_function = nn.MSELoss()
model = Fingerprint(radius, T, num_atom_features, num_bond_features,
fingerprint_dim, output_units_num, p_dropout)
model.cuda()
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')
best_model_dict = best_model.state_dict()
best_model_wts = copy.deepcopy(best_model_dict)
model.load_state_dict(best_model_wts)
(best_model.align[0].weight == model.align[0].weight).all()
model_for_viz = Fingerprint_viz(radius, T, num_atom_features, num_bond_features,
fingerprint_dim, output_units_num, p_dropout)
model_for_viz.cuda()
model_for_viz.load_state_dict(best_model_wts)
(best_model.align[0].weight == model_for_viz.align[0].weight).all()
ERROR:
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
I fixed it by commenting out the following lines of code:
model.cuda()
model_for_viz.cuda()
Going through the code I noticed that most instances of the "torch." code was followed by "cuda" and since my machine does not have GPU, I had to remove the "cuda" in the notebook and the AttentiveFP folder to remain with for example:
torch.FloatTensor()
And in this line of code, I specified for it to be mapped to cpu:
best_model = torch.load('saved_models/model_ppb_3922_Tue_Dec_22_22-23-22_2020_'+'54'+'.pt', map_location='cpu')
The last cell had a syntax error by having an extra parenthesis in the "f.write function", but it was fixed:
f.write(str(r[i]['SA']) + '\t' + str(r[i]['Non_SAs']) + '\t' + str(r[i]['score']) + '\t' + str(r[i]['RES']) + '\t' + str(r[i]['CES']) + '\t' + str(r[i]['ZES']) + '\t' + str(r[i]['NTS']) + '\n')
After all that, rest of the notebook ran successfully and these were the results:
MODEL SUMMARY The model process was as follows:
RESULTS
After running:
remain_pred_list = eval(model, remained_df)
remained_df['Predicted_values'] = remain_pred_list
remained_df
OUTPUT
cano_smiles | Predicted_values -- | -- O=C(O)CC(c1ccccc1)n1ccc2cc(OCCc3ccc4c(n3)NCCC4... | 0.970726 CN(C)Cc1cncc(C(CC(=O)O)n2ccc3cc(OCCc4ccc5c(n4)... | 0.850634 CC(C)N1CN(C(c2ccccc2)c2ccccc2)n2ccc(=O)c(O)c2C1=O | 0.946909 COCCN1CN(C(c2ccccc2)c2ccccc2)n2ccc(=O)c(O)c2C1=O | 0.923631Based on my research to fathom how the model works, I stand to be corrected, but it seems that these values are the predicted PPB fractions of a compound.
The model iterated through various substructures of the compounds. The processes are:
1) Generate substructure patterns (substructure fragments) using the SMILES notation of the chemical molecule using the RDKit's Chem.MolFromSmarts() function.
2) The chemical compound's SMILES notation is used by the code to generate an RDKit molecule object using Chem.MolFromSmiles(). It identifies the compound's atom numbers that correspond to the substructure pattern.
3) Calculate p-values for each substructure using the Mann-Whitney U test. Two sets of values are subjected to the test:
CRITERIA
Compound without Privileged Substructures
O=C(O)CC(c1ccccc1)n1ccc2cc(OCCc3ccc4c(n3)NCCC4)ccc21
[]
Predicted PPB fraction: 0.9707257
Dectected Priviledged Substructures: []
Compound with Privileged Substructures
CN(C)Cc1cncc(C(CC(=O)O)n2ccc3cc(OCCc4ccc5c(n4)NCCC5)ccc32)c1
['*Cc1cncc(C(C*)*)c1']
Predicted PPB fraction: 0.8506337
Dectected Priviledged Substructures: ['*Cc1cncc(C(C*)*)c1']
The final output, which was the "Results.smi" had the following values:
SA_Fragment NAS Score RES CES ZES NTS
CN1CC(*)C1 *c3ccccc3 0.14786313364055315 0 23 41 131
CN1CC(*)C1 *C(=O)NC2C(=O)N3C(=CCSC23)C(=O)O -0.1598742305685968 57 0 0 0
CN1CC(*)C1 *C3CN(CCO)C3 -0.4474728208647905 106 0 0 0
CN1CC(*)C1 CC=C* -0.29621884951206984 265 7 4 12
CN1CC(*)C1 *C(=O)NC1C(=O)N2C(=C(C)CSC12)C(=O)O -0.15223166491043205 55 0 0 0
CN1CC(*)C1 *C(=O)C(O)* 0.21906800000000004 16 0 4 34
CN1CC(*)C1 *CSC* -0.3888930158730158 278 14 5 0
CN1CC(*)C1 *C1=C(C(=O)O)N2C(=O)C(NC(=O)C*)C2SC1 -0.1627462264150944 73 0 0 0
For example this compound:
CN1CC()C1
These are the values:
I didn't fully understand the role of the "ppb_3922.csv" and this code but I'm seeking more information and I'm requesting @DhanshreeA to kindly assist. Here are the results after reading the ppb_3922.csv file:
CN1CC(*)C1 matches 125 compounds
Totally find 5062 fragments
For CN1CC(*)C1 totally find 8 second-level substructures!
I got the "eml_canonical.csv" dataset from Ersilia's Essential Medical list. I made a copy of the "IDL-PPBopt.ipynb" notebook to run this dataset as the input file.
ERRORS
When running this code,
feature_dicts = save_smiles_dicts(smilesList,filename)
remained_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(feature_dicts['smiles_to_atom_mask'].keys())]
uncovered_df = smiles_tasks_df.drop(remained_df.index)
print(str(len(uncovered_df.cano_smiles))+' compounds cannot be featured')
remained_df = remained_df.reset_index(drop=True)
I got this error:
TypeError: No registered converter was able to produce a C++ rvalue of type std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
Upon further investigation, I noticed that in the previous cell output in
When loading the dataset using this code,
task_name = 'ppb'
tasks = ['endpoint']
raw_filename = "eml_canonical.csv"
feature_filename = raw_filename.replace('.csv','.pickle')
filename = raw_filename.replace('.csv','')
prefix_filename = raw_filename.split('/')[-1].replace('.csv','')
smiles_tasks_df = pd.read_csv(raw_filename)
smilesList = smiles_tasks_df.cano_smiles.values
print("number of all smiles: ",len(smilesList))
atom_num_dist = []
remained_smiles = []
canonical_smiles_list = []
for smiles in smilesList:
try:
mol = Chem.MolFromSmiles(smiles)
atom_num_dist.append(len(mol.GetAtoms()))
remained_smiles.append(smiles)
canonical_smiles_list.append(Chem.MolToSmiles(Chem.MolFromSmiles(smiles), isomericSmiles=True))
except:
print(smiles)
pass
print("number of successfully processed smiles: ", len(remained_smiles))
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df["cano_smiles"].isin(remained_smiles)]
# print(smiles_tasks_df)
smiles_tasks_df['cano_smiles'] =canonical_smiles_list
assert canonical_smiles_list[0]==Chem.MolToSmiles(Chem.MolFromSmiles(smiles_tasks_df['cano_smiles'][0]), isomericSmiles=True)
The output was
number of all smiles: 443
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
nan
number of successfully processed smiles: 442
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
[18:57:41] WARNING: not removing hydrogen atom without neighbors
There was a nan value in the dataset. To solve this issue, I found this solution referenced here: https://github.com/rdkit/rdkit/issues/2994#issuecomment-1026085560
I removed the nan value in Step 1 code cell using:
smiles_tasks_df = smiles_tasks_df[smiles_tasks_df.cano_smiles.notna()]
The output was:
number of all smiles: 442
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
number of successfully processed smiles: 442
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
[23:13:03] WARNING: not removing hydrogen atom without neighbors
When Calculating the molecular feature, this was the output:
[CaH2]
[23:13:08] WARNING: not removing hydrogen atom without neighbors
[23:13:08] WARNING: not removing hydrogen atom without neighbors
[F-]
[23:13:12] WARNING: not removing hydrogen atom without neighbors
[23:13:12] WARNING: not removing hydrogen atom without neighbors
[I]
[23:13:16] WARNING: not removing hydrogen atom without neighbors
[23:13:16] WARNING: not removing hydrogen atom without neighbors
O
[Cl-].[K+]
[I-].[K+]
S
[23:13:29] WARNING: not removing hydrogen atom without neighbors
[23:13:29] WARNING: not removing hydrogen atom without neighbors
N.N.[Ag+].[F-]
[Cl-].[Na+]
[23:13:36] WARNING: not removing hydrogen atom without neighbors
[23:13:36] WARNING: not removing hydrogen atom without neighbors
feature dicts file saved as eml_canonical.pickle
9 compounds cannot be featured
RESULTS
After running:
remain_pred_list = eval(model, remained_df)
remained_df['Predicted_values'] = remain_pred_list
remained_df
OUTPUT
drugs | smiles | cano_smiles | Predicted_values -- | -- | -- | -- abacavir | Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1 | Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1 | 0.477624 abiraterone | C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(... | C[C@]12CC[C@H](O)CC1=CC[C@@H]1[C@@H]2CC[C@]2(C... | 0.976628 acetazolamide | CC(=O)Nc1sc(nn1)[S](N)(=O)=O | CC(=O)Nc1nnc(S(N)(=O)=O)s1 | 0.582832 acetic acid | CC(O)=O | CC(=O)O | 0.071088 acetylcysteine | CC(=O)N[C@@H](CS)C(O)=O | CC(=O)N[C@@H](CS)C(=O)O | 0.650531The predicted PPB fraction values for various drugs are as shown above
The procedure was the same as the initial dataset. I was only able to load four compounds in this code without killing the kernel :laughing: I had to restart the kernel several times because it was dead and ended up reducing the values to match the initial four input compounds, then it worked. As a caution, I had to run each cell on its own instead of "Run all".
For these drugs, two had Privileged substructures while the other two did not have Privileged substructures:
Drug without Privileged Substructures
CC(=O)Nc1nnc(S(N)(=O)=O)s1
[]
Predicted PPB fraction: 0.5828322
Dectected Priviledged Substructures: []
Drug with Privileged Substructures
C[C@]12CC[C@H](O)CC1=CC[C@@H]1[C@@H]2CC[C@]2(C)C(c3cccnc3)=CC[C@@H]12
['*c1cccnc1']
Predicted PPB fraction: 0.97662824
Dectected Priviledged Substructures: ['*c1cccnc1']
The model was downloaded using the command:
docker pull ersiliaos/eos22io
output: docker_output1.txt
To run the model:
docker run ersiliaos/eos22io
Afterwards, Docker crashed due to limited space and I have not been able to continue. I'm still trying to find a way to fix this
For conditions like cancer, TB, malaria, and HIV, combination treatments have emerged as the gold standard of treatment. Finding efficient combination medicines when they are accessible in a given circumstance is difficult, nevertheless, due to the combinatorial set of multi-drug treatments that are currently available. Drug combination is advantageous because of less drug resistance and when used in low dosages.
Effective Drug Combination Discovery: One of Ersilia's main goals is to treat serious illnesses like HIV, TB, cancer, and malaria. Combination therapies are frequently very successful in addressing these conditions. In environments with limited resources, this strategy can help with the effective discovery of medication combinations.
Better Access to Efficient Treatments: The model assists in determining and maximizing the use of combination medications. This may result in more affordable and easily accessible treatment alternatives, particularly in settings with limited resources where access to cutting-edge medical treatments may be restricted.
Overcoming Combinatorial Complexity: Because there are so many possible combinations, it might be difficult to find successful medicine combinations. In order to navigate this complexity and make it simpler to find viable combinations, this model makes use of machine learning and data analysis.
Decreased Drug Resistance: Combination treatments have a reputation for being able to slow down the emergence of drug resistance. The capacity of this model to pinpoint drug combinations that are less likely to result in resistance can be advantageous to Ersilia, increasing the efficacy of therapies.
Optimal Dosage: The model can assist in figuring out how much is best for combination treatments. In situations when resources for treating side effects are scarce, this is especially important for guaranteeing that therapies are effective while limiting side effects and the risk of toxicity.
Scientific Progress: Ersilia advances the field of drug combination research by implementing this methodology. This is in line with Ersilia's mission to use data science and technology to further investigate and solve urgent global health issues.
Create a Conda environment, activate it and install the required dependencies
The dataset has already been split into data/final_train_set.jsonl
and data/final_test_set.jsonl
You can run the Base line model found at the HuggingFace website which details how to dowload the model, associated code and how to load the model.
Model Training: To train the model, run the Python script train.py
which imports the pubmedbert_2021
model(BERT-based model fine-tuned for biomedical text data from PubMed) and trains it through 10 epochs, learning rate of 2e-4 (0.0002) and batch size of 18.
data/final_train_set.jsonl
dataset to trainModel Testing and Validation: To test and evaluate the model, run the Python script test_only.py
which specifies the location of the checkpoint directory that contains the trained model and its related files.
data/final_test_set.jsonl
dataset for testing, setting the batch size to 100 and random seed to 2022. checkpoints_pubmedbert_2022/outputs/
directoryLink to Publication:
Source Code and Dependencies: https://github.com/allenai/drug-combo-extraction
Contributors:
As per the GitHub Repo Reference requirement :
@inproceedings{Tiktinsky2022ADF, title = "A Dataset for N-ary Relation Extraction of Drug Combinations", author = "Tiktinsky, Aryeh and Viswanathan, Vijay and Niezni, Danna and Meron Azagury, Dana and Shamay, Yosi and Taub-Tabib, Hillel and Hope, Tom and Goldberg, Yoav", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.233", doi = "10.18653/v1/2022.naacl-main.233", pages = "3190--3203", }
Acinetobacter baumannii is a nosocomial Gram-negative pathogen that often displays multidrug resistance. Through standard screening techniques, it has been difficult to find new antibiotics to treat A. baumannii. Luckily, rapid chemical space research is made possible by machine learning techniques, which raises the likelihood of finding new antibacterial compounds. They screened 7,500 compounds to find those that prevented A. baumannii from growing in vitro. Using this growth inhibition dataset, they trained a neural network and used it to make in silico predictions for structurally new compounds that exhibit anti-A. baumannii action. Using this method, they were able to identify the antibacterial chemical Abaucin, which has a restricted range of action against A. baumannii. Subsequent research demonstrated that Abaucin affects lipoprotein trafficking by means of LolE- a protein related to the lipoprotein trafficking pathway, which is responsible for the sorting and transport of lipoproteins to the outer membrane of Gram-negative bacteria. More information about Gram-Negative Lipoprotein Trafficking can be found here. Additionally, Abaucin could manage an infection caused by A. baumannii in a mouse wound model.
Targeted Antibiotics: It was found that Abacin disrupts lipoprotein trafficking by using LolE. By focusing on specific areas, this strategy can lower the likelihood of widespread antibiotic resistance and offer better therapeutic alternatives. Similar tactics can be used by Ersilia to create antibiotics that target microorganisms related to neglected diseases particularly.
New Antibiotic Discovery: Abacin is a newly discovered antibiotic that exhibits narrow-spectrum action against the difficult Gram-negative infection Acinetobacter baumannii. By incorporating this research, Ersilia can contribute to the discovery of additional antibiotics for combating infectious diseases in low-resourced regions.
Reduced Cost and Resource Requirements: When compared to conventional approaches, machine learning-based drug development and screening can be more affordable and demand less resources. This benefit is consistent with Ersilia's goal of assisting institutions in underdeveloped nations since it permits significant research to be conducted even with constrained funding and resources.
To understand how the model works, there is a need to know how Chemprop works, as the model is based on it.
It is a direct message passing neural network that predicts the likelihood of a given molecule inhibiting the growth of a specific bacteria. MPNNs aggregate local chemical features iteratively in order to predict properties. It works by using a direct bond-to-bond-based message-passing approach. It iteratively aggregates the features of every single individual atom and bond. For example, atom 2 can have details about the structures of atom 1, atom 3 and atom 4 which form the vector representation of atom 2. In summary, It traverses around the molecule, creates vector representations, and passes messages from atom to atom by traveling across each of the bonds.
Link to Publication:
Source Code and Dependencies: https://github.com/GaryLiu152/chemprop_abaucin/tree/main
Contributors: Gary Liu, Denise B. Catacutan, Khushi Rathod, Jody C. Mohammed, Meghan Fragis, Kenneth Rachwalski, Jakob Magolan, Brian K. Coombes & Jonathan M. Stokes
An essential function of the human {\it ether-a-go-go} (hERG) potassium channel (Kv11.1) is to mediate the cardiac action potential. This ion channel's blockage may result in long QT syndrome or a lethal disorder. A number of medications have been discontinued due to significant hERG-cardiotoxicity. In the first stages of drug discovery, it is imperative to evaluate the hERG blocking activity. The hERG-cardiotoxicity of compounds found in the DrugBank database is of special interest to them since several of these compounds have been licensed for use as medicinal treatments or have a strong potential for development into pharmaceuticals. In silico methods based on machine learning provide a quick and affordable way to virtually screen DrugBank molecules.
The binding efficacy of the DrugBank compounds on the hERG channel is quantitatively analyzed by means of regressors that they constructed after designing strong and accurate classifiers for blockers/non-blockers. Two natural language processing (NLP) techniques, autoencoder and transformer, are used to insert molecular sequences. Complementary three-dimensional (3D) molecule structures are embedded using two sophisticated mathematical techniques: algebraic graphs and topological Laplacians. Using their cutting-edge instruments, they found that 227 of the 8641 DrugBank compounds may be hERG blockers, indicating significant issues with drug safety. Their forecasts offer direction for additional experimental investigation into the hERG-cardiotoxicity of DrugBank drugs.
Drug Safety Assessment: Severe cardiac problems can result from blocking the hERG potassium channel (Kv11.1), which is essential for modulating the cardiac action potential. Evaluating the safety and effectiveness of medications is for Ersilia's work, particularly in areas with limited resources. By concentrating on hERG-cardiotoxicity prediction, Ersilia can improve their ability to assess drug safety by implementing this approach.
Drugs discontinued: A number of drugs have been pulled off the market because of serious hERG-cardiotoxicity, which emphasizes how crucial it is to identify any problems early on. By utilizing this approach, Ersilia can pinpoint drugs that provide a danger of hERG blocking, so impeding the creation of cardiotoxic medications and advocating for more secure substitutes.
DrugBank Database: Ersilia is interested in the DrugBank database since it contains substances that have a great chance of being developed into pharmaceuticals. They can efficiently screen the drugs in DrugBank with this model by identifying any hERG blockers early in the drug discovery process. This can encourage the creation of safer drugs while saving time and money.
In Silico Screening: The machine learning-based in silico techniques used in this model offer a rapid and economical means of virtually screening compounds. Ersilia frequently works in contexts with limited resources, thus this strategy fits with their goal of giving researchers access to data science tools. It enables a quick first evaluation of possible therapeutic candidates.
The DrugBank database, which includes details on a variety of chemicals, including FDA-approved medications and experimental pharmaceuticals, provides the training and assessment data for the model. The model is trained and evaluated on many datasets with "yes" or "no" labels (indicating hERG blockers or non-blockers). Interestingly, the model captures a broad range of molecular structures and chemical diversities by incorporating data from many sources.
Feature Engineering: To extract pertinent information from chemical structures, the model makes use of sophisticated feature engineering techniques. Sequence-based embeddings and 3D structure-based embeddings are two different forms of embeddings that are integrated.
Machine Learning Algorithms: The model uses machine learning algorithms for both classification and regression tasks.
Model Ensemble: The model combines the results from multiple machine learning models. It integrates the predictions from six classification models, each utilizing different combinations of the feature embeddings and machine learning algorithms. Consensus results are derived by averaging the probabilities generated by these models.
Model Evaluation: The model's performance is evaluated on various datasets, including those with different origins. It is compared against other published models to assess its predictive capabilities.
Link to Publication:
Source Code and Dependencies: https://github.com/WeilabMSU/hERG-prediction#virtual-screening-of-drugbank-database-for-herg-blockers-using-topological-laplacian-assisted-ai-models
Contributors: Hongsong Feng Guo-Wei Wei
Hello,
Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!
Week 1 - Get to know the community
Week 2 - Install and run an ML model
Week 3 - Propose new models
Week 4 - Prepare your final application