ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: <Luis_Camacho> #872

Closed luiscamachocaballero closed 8 months ago

luiscamachocaballero commented 8 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

luiscamachocaballero commented 8 months ago

Following the instructions, the Ersilia Model Hub installation was very straightforward. Only I suggest editing Pre-requisite 5: The Isaura data lake, where it wrongly was put the instruction conda activate ersilia. Also, the link to install Docker moves to a piece of broad information, and I think it could cause confusion; people could figure out they have to install only Docker Desktop, I mean, Pre-requisite 6 must indicate to install Docker Engine

luiscamachocaballero commented 8 months ago

Motivation statement I was a research manager at PUCP. In 1998, I founded the Rural Telecommunications Research Group (GTR) at PUCP and then I chose the hardest option: intense field labor in the more extreme rural zones of the world instead of a traditional tenure-track on PUCP campus, I spent 15 years carrying out ICT4D projects in the Amazon rainforest for the sake of low-income small villagers. I started working immediately on connectivity for very isolated rural villages located in the Amazon rainforest. Despite the fact I was a scholar, my team acted as activist practitioners, more focused on the sake of population than on research, the work was extremely intense, our major concern was that public health workers had connectivity as a tool to deal with emergencies and fight against neglected diseases like malaria. After many attempts and several failures (we started trying unstable AX.25 packet radio technology), the effort was rewarded, in 2006 we managed to install a chain of WiFi repeaters, and the following year we managed to expand the chain to reach 500 km in length, from Cabo Pantoja to Iquitos, we called it NapoNet, as far I know it's the longest chain in the world. I worked on that for four more years until 2011. I'm happy that 14 years after the installation, the chain is still working, it survived without us!; well, not exactly, we never got away completely, but the responsibility of the maintenance fell to the local authorities. The lesson was clear: long-term support for projects to succeed. In 2013 we started a new project, Tucan3G, which is very similar to Mexican Rizhomatica but with NapoNet as backhaul and with TELCO Movistar (BME: TEF) as a partner to avoid legal issues for using mobile frequency bands.

In 2016 I started another big project: I launched SIMINCHIKKUNARAYKU Initiative which proposes a holistic vision: use of artificial intelligence, multimedia, and public policy (based on economical evidence) to preserve and foster South American native languages. Our first target was Quechua, including all its dialects, the most spoken American native language. Nowadays, Artificial Intelligence is fueling a new industrial revolution, Artificial Intelligence also allows the computational portability of languages, and computational portability involves creating systems of natural language processing (NLP). Artificial Intelligence is the disruptive and ultimate technology for the preservation of endangered languages, but it’s clear that alleviating the lack of resources is the first step and that needs big bucks and that funding is not available unfortunately, so I am forced to leave my project stand-by until I reach to convince big donors, angel investors or South American decision-makers make a move in the right direction.

Running my last endeavor, I developed new skills like Artificial Intelligence, Natural Language Processing, Big Data, DevOps, and Cloud Computing. I am happy I found Outreachy and Ersilia, I feel I can use my skills to make a contribution again for the sake of rural communities threatened by neglected deadly diseases. I think I could re-engage with South American public health officers and show them the opportunity to use Ercilia as a new weapon against malaria and other serious tropical illnesses. While this networking could help me to advance my career, during the internship I wish to understand deeply Ercilia's code and pipeline and spend that time coding as much as possible; despite the fact I am a manager, and electronics and telecommunications engineer, I feel pleasure writing code. I'd be glad if Ercilia gave me a chance.

luiscamachocaballero commented 8 months ago

Dealing with COVID is urgent but I think many people are already working on that. So, that is why I selected Plasma Protein Binding (IDL-PPBopt). As far as I know, PPB means that if there is no protein that transports some kind of hormone, that hormone does not work. So a non-empirical treatment alternative is that there is a substance or a medication that increases the binding of a protein to a hormone or that causes this protein to increase in quantity in the blood so that, perhaps, the hormones, whose production is deficient, can join together in sufficient numbers to be transported to the organ that needs the hormone. For example, if there is no binding protein then it happens a disease known as hypothyroidism or deficiency of adequate amounts of thyroid hormone. On the other hand, the other extreme is also negative, sometimes proteins exist in such quantities that they bind closely to the hormone or chemical substance, causing damage. Both excess and deficiency of hormones are harmful.

luiscamachocaballero commented 8 months ago
(ersilia) alonso@alonso-OMEN-by-HP-Laptop:~/ersilia$ ersilia serve eos22io
🚀 Serving model eos22io: idl-ppbopt

   URL: http://127.0.0.1:47081
   PID: 545296
   SRV: conda

👉 To run model:
   - run

💁 Information:
   - info
(ersilia) alonso@alonso-OMEN-by-HP-Laptop:~/ersilia$ ersilia api -i 'CCCOCCC'
{
    "input": {
        "key": "POLCUAVZOMRGSN-UHFFFAOYSA-N",
        "input": "CCCOCCC",
        "text": "CCCOCCC"
    },
    "output": {
        "outcome": [
            0.36892343
        ]
    }
}
luiscamachocaballero commented 8 months ago

Screenshot from 2023-10-23 11-21-05 Screenshot from 2023-10-23 11-21-27 Screenshot from 2023-10-23 11-22-15 Screenshot from 2023-10-23 11-22-37

Outputs of running Step 6. Identify the Privileged Substructure for each molecule

Finally, I could run the model after disabling all mentions of CUDA in the ipynb file including: torch.backends.cudnn.benchmark = True

To run predictions, as required, I put the file Essential Medicines List as input of the script IDL-PPBopt.ipynb, for compatibility, I have to edit the input's heading from smiles to cano_smiles.

This is the output

The predictive model gauges the propensity of compounds to adhere to plasma proteins. Employing "canonical smiles" as input, yields predictions as output, with values closer to 1 indicating higher drug-protein binding probability, and values closer to 0 indicating lower probability. Its Neural Network architecture, including AttentiveFP and various activation functions, empowers accurate regression tasks.

luiscamachocaballero commented 8 months ago

Installing and running Docker

I ran the code with docker pull ersiliaos/eos22io to pull the IDL-PPB image from Ersilia Hub to docker, the model pulled successfully after ten minutes since I downloaded almost 5 Gb data.

Next, I copy the eml_canonical.csv file to the docker environment.

Then i ran ersilia -v api run -i eml_canonical.csv -o ersilia_output.csv to make the alternative predictions

Finally, I downloaded the output file

Comparing results with the Ersilia Model Hub implementation Running Docker and the Ersilia Model Hub there was an answer to the 442 inputs, however, 9 compounds resulted in NULL prediction values. These compounds are not represented in SMILES notation, therefore, they could not be used as features for prediction, which is why their predictions are NULL. In my previous experience, using the ipynb file provided by the authors of the IDL-PPBopt model, the output file omitted the answer to these 9 compounds, so there were only 433 prediction values. Finally, the prediction values were practically the same in both cases.

luiscamachocaballero commented 8 months ago

My first selection is Machine learning approaches to optimize small-molecule inhibitors for RNA targeting This research is relevant because it pertains to the topic of optimizing small-molecule inhibitors for RNA targeting through machine-learning approaches. It covers various aspects such as the development of data-driven algorithms, the use of machine learning models for predicting binding, the synthesis of new inhibitors, experimental corroboration of binding, and the application of the models in the laboratory. These details highlight the significance of using machine learning in drug discovery and the efforts made to optimize the inhibitory effect of small molecules on RNA targets.
The model utilizes data-driven algorithms and various machine-learning techniques to predict the binding of small molecules to RNA targets. The model takes into account different features and parameters, such as chemical properties and molecular structures, to make accurate predictions of the inhibitory effect of the molecules. By analyzing a benchmark dataset of small molecules and applying machine learning algorithms, the model can extract learning principles and identify essential features that influence binding to RNA targets. This information is then used to design and synthesize new inhibitors with improved binding properties. Additionally, the model's predictions are experimentally validated by testing the inhibitory effect of the synthesized compounds on M. smegmatis ribosomes using a bacterial coupled transcription/translation assay. The accuracy and effectiveness of the model are evaluated based on similar predictions obtained in the laboratory.

The authors provided a Python class called LGRF30. This class is used to predict the binding of molecules to a target protein, using a decision tree classifier, named Model30, with a binding threshold of -13.2. The decision tree classifier is composed of just three features: num_of_N, num_of_C, and HelnKierAlpha.

The LGRF30 class has two main functions: main() and evaluate(). The main() function loads the trained decision tree classifier model and then uses it to predict the binding of the molecules in the input data. The evaluate() function evaluates the performance of the classifier on a given dataset, by calculating the mean squared error (MSE), mean absolute error (MAE), and R-squared.

To use the LGRF30 class, you first need to create an instance of the class. You can do this by passing in the input data as a Pandas DataFrame. The input data should contain the following columns:

Name: The name of the molecule.
num_of_N: The number of nitrogen atoms in the molecule.
num_of_C: The number of carbon atoms in the molecule.
HelnKierAlpha: The HelnKier alpha descriptor of the molecule.

Once you have created an instance of the LGRF30 class, you can call the main() function to predict the binding of the molecules in the input data. The main() function will return a Pandas DataFrame containing the predicted binding scores for each molecule.

If you also have the ground truth binding scores for the molecules, you can call the evaluate() function to evaluate the performance of the classifier. The evaluate() function will return a Pandas Series containing the MSE, MAE, and R-squared values.

Here is an example of how to use the LGRF30 class:

import LGRF30

#Load the RDkit features for the molecules:
mol2_org = LGRF30.get_rdkit_features_df(filePath=r'\Data\aligned.mol2', names=['HallKierAlpha'])

#Create a new LGRF30 object:
lgrf30 = LGRF30(X_from_RD_features=mol2_org)

#Predict the binding scores:
predictions_org = lgrf30.main()

#Print the predicted binding scores
print(predictions_org)
luiscamachocaballero commented 8 months ago

My second choice is MRlogP: Transfer Learning Enables Accurate logP Prediction Using Small Experimental Training Datasets. This study on logP prediction using transfer learning is highly relevant due to its direct implications for drug discovery and medicinal chemistry. LogP, which measures the lipophilicity of a compound, is a critical property that influences its absorption, distribution, metabolism, and excretion (ADME). Accurate logP prediction is crucial for optimizing compound design and predicting its behavior in vivo. By improving logP prediction accuracy, this study offers a valuable tool for enhancing the drug discovery process, leading to the identification of more effective drug candidates. The study compares various logP prediction methods, highlighting their strengths and weaknesses. Understanding their performance characteristics enables researchers to select the most appropriate approach for their specific needs. Additionally, the introduction of transfer learning techniques in logP prediction showcases the potential for leveraging existing data to enhance modeling accuracy, even with limited experimental measurements. Ultimately, this research has far-reaching implications beyond drug discovery. Accurate logP prediction is relevant in fields such as environmental chemistry, where understanding chemical behavior and fate is crucial; this study opens pathways for more accurate predictions in various scientific domains, contributing to advancements in both research and practical applications.

The code is carefully detailed in a README file, the data trained is available but there are no checkpoints

luiscamachocaballero commented 8 months ago

My third selection is Predicting Antimalarial Activity in Natural Products Using Pretrained Bidirectional Encoder Representations from Transformers one publication that describes an open-source ML model that could be of interest to Ersilia. Predicting antimalarial activity in natural products is relevant because it can significantly expedite the process of discovering new potential drugs to combat malaria. Malaria is a life-threatening disease caused by the Plasmodium parasite and affects millions of people worldwide, particularly in developing countries. Traditional drug discovery methods can be time-consuming and costly. Therefore, the use of predictive models, such as machine learning, offers a more efficient approach. By utilizing machine learning algorithms and incorporating chemical properties, researchers can develop models that predict the antimalarial activity of natural products. These models can save significant time and resources by narrowing down the pool of compounds that need to be experimentally tested for their antimalarial potential. Furthermore, natural products have long been recognized as important sources of drug leads, with many effective antimalarial drugs originating from natural sources. Predicting the antimalarial activity of natural products can identify promising candidates for further investigation, potentially leading to the discovery of novel antimalarial drugs.

About the software, there is an NPBERT pre-trained model with a configuration JSON file and a tokenizer. To run the model, you must follow the steps below:

Requirements:

rdkit == 2019.09.1.0
transformer == 4.2.2
python 3.7

Generate features: Generate feature with SMILES input

python3 extract_feature.py --input_smile = "C1CCCCC1C2CCCCC2" Generate feature with *csv file containing SMILES inputs python3 extract_feature.py -- input_file= 'in_PATH/Intput.csv' --output_file = 'OUT_PATH/Output.csv'

The data is available but checkpoints aren't.

GemmaTuron commented 8 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!