✍️ Contribution period: juliet owanku

julietowah commented 1 year ago

Week 1 - Get to know the community

[X] Join the communication channels
[X] Open a GitHub issue (this one!)
[X] Install the Ersilia Model Hub and test the simplest model
[X] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Install and run an ML model

[x] Select a model from the suggested list
[x] Install the model in your system
[x] Run predictions for the EML
[x] Compare results with the Ersilia Model Hub implementation!
[x] Install and run Docker!

Week 3 - Propose new models

[x] Suggest a new model and document it (1)
[x] Suggest a new model and document it (2)
[x] Suggest a new model and document it (3)

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

julietowah commented 1 year ago

Join the communication channels

i was able to join the communication channel on slack and am so excited to meet few people who has been helping me to set up my environment.

julietowah commented 1 year ago

Install the Ersilia Model Hub and test the simplest model

i have successfully installed the Ersilia model Hub and tested it using

ersilia -v fetch eos3b5e ersilia serve eos3b5e ersilia -v api run -i "CCCC"

and my output is:

{ "input": { "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N", "input": "CCCC", "text": "CCCC" }, "output": { "mw": 58.123999999999995 } }

julietowah commented 1 year ago

My motivation statement:

Hello everyone my name is juliet Owanku am from Nigeria. am still growing my skill to becoming a software engineer and i have been opportune to do some projects using Html, css, javascript, and python.

One thing that kept me motivated in pursuing this career was my urge to joining in bringing solution using tech. i have been making some progress in trying to find a problem and creating solution for it .

i was so happy upon coming across Ersilia project because its just Everything i wanted (to become part of a solution system) and i am so happy to contribute to this project and hopefully having a great impact to the society in the nearest future.

For me coming from this part of the world i know that this project will be of a great relief in the health sector and i look forward to leaning and improving my skill in programming generally. thank you

julietowah commented 1 year ago

Select a model from the suggested list

i choose SARS-CoV2 activity (Image Mol) because of these reasons:

after the global shot down to cob covid-19, the virus has mutilated and formed more mutant virus which made it difficult to keep track but with this discovery an AI ImageMol which is trained on molecular images to predict molecular targets of candidate compounds is created and this will help in predicting potential drugs only by identifying the molecular properties of the virus. this can be a big a very big breakthrough in medical science.

i would really love to be part of this great establishment as i would learn so many things and also join in the team that contributed to the development of this project.

julietowah commented 1 year ago

hi @DhanshreeA please i would love your help on the imagmol pretrain am kinda lost and i would love to get back on track

leilayesufu commented 1 year ago

Hi Juliet, Where are you lost? Have you installed the model on your system?

julietowah commented 1 year ago

i created new environment using

conda create -n imagemol python=3.7.3

then avtivated it with

conda activate imagemol

dounloaded the packeges for pytorch, torch-cluster torch-scatter torch-sparse torch-spline-conv

conda install -c rdkit rdkit
pip install https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl
pip install https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl
pip install torch-cluster torch-scatter torch-sparse torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0%2Bcu101.html

then i was able to download the pretrain dataset and saved it in ./datasets/pretraining/data/ preprocess the dataset

then got lost at start to pretrain](https://github.com/HongxinXiang/ImageMol#2-start-to-pretrain) i just need some explanations and directions thank you

leilayesufu commented 1 year ago

Hi, i just went through the model and i don't think you're to pretrain but rather make predictions using a pre-trained model.

leilayesufu commented 1 year ago

This should be the model you're to use. "SARS-CoV2 activity (Image Mol)"

If you've installed the neccessary packages, you can skip pre-training and go to finetuning in the github repository

julietowah commented 1 year ago

ok thank you so much @leilayesufu

julietowah commented 1 year ago

CHANGE OF MODEL TO NCATS Rat Liver Microsomal Stability

i to change my model because of some system limitations i tried to fix it but it was taking so much time to i had to switch to NCATS Rat Liver Microsomal Stability

my reasons for this model is because It is used to get information about the metabolic stability of compounds, often measured using rat liver microsomes. and with this discovery it is possible to monitor and and dictate the stability of drugs in rat liver microsomes which helps to know how the drugs will also behave in an organism's liver.

i believe this will also bring advancement in the pharmaceutical industry

julietowah commented 1 year ago

Install the model in your system

i first cloned the repo with

git clone --recursive https://github.com/ncats/ncats-adme.git

Installing required software

as i have previously installed miniconda in my system i just had to create the environment in my terminal

ncats-adme/server

created the environment using

conda env create --prefix ./env -f environment.yml

i activated the environment with

conda activate ./env

julietowah commented 1 year ago

Running the application

i run the code with

python app.py

i then went to the http://127.0.0.1:5000

julietowah commented 1 year ago

Run predictions for the EML

please @DhanshreeA any help on this error

There was an error processing your file. Please make sure you have selected a file that contains SMILES, indicate if the file contains a header and the column number containing the SMILES.

trying to run the predictions

i have been able to solve it : i just have to reload my browser

when i opened the browser at http://127.0.0.1:5000/ i clicked on predict then choose textfile format
I only selected RLM stability and HLC stability, which represent Rat Liver Microsomal Stability and Human Liver Cytosolic Stability.
then i browsed to choose the EML (eml_canonical.csv) csv file from my computer
uploded it and clicked the process file button .

from the prediction result which has three columns first column is the molecular image structure the second column is the Predicted Class (Probability) which indicates the predicted class for the molecule with two classes:Class 0 represents stable. It means the molecule is stable. Class 1 represents unstable then the third column is the Prediction which specifies whether it's predicted to be "stable" or "unstable".

in the Rat Liver Microsomal Stability the first row shows 1 (1.0) which means unstable and the second row shows stable with 0 (0.94) in the same vain the Human Liver Cytosolic Stability first row shows
0 (0.61) which is stable and stable to for second row with 0 (0.56)

julietowah commented 1 year ago

Ran the prediction again using selecting PAMPA Permeability (pH 7.4) and PAMPA Permeability (pH 5.0)

PAMPA Permeability (pH 7.4) and PAMPA Permeability (pH 5.0) which means Parallel Artificial Membrane Permeability Assay.Permeability measures the ability of a substance (often a drug or chemical compound) to cross a model membrane under specific conditions. in this case measurement is performed at a pH of 7.4 and 5.0 respectively.

from the above illustration High permeability indicates that the compounds are expected to pass through biological membranes more readily, which can be important in drug absorption. Low or moderate permeability suggests a slower or controlled rate of permeation. so in the first row of the PAMPA Permeability (pH 7.4) table it has low or moderate permeability with the probability of 1(1.0) and the second row has high permeability with 0.9 proberbility in PAMPA Permeability (pH 5.0) the first ro has low permiability with proberbility of 1(0.9) and the second row has moderate or high permeability with proberbility of 0(1.0)

julietowah commented 1 year ago

Compare results with the Ersilia Model Hub implementation! : Human Liver Cytosolic Stability.

I opend the Ersilia Model Hub link and clicked on the tab Microsomal stability. which has the Human Liver Microsomal Stability and Rat liver microsomal stability. I clicked on the Human liver microsomal stability and was redirected to the Ersilia GitHub repo of the model. went through the README.md in the repo and clicked on the DockerHub link.

The Human liver microsomal stability model code is eos31ve.

I opened a terminal and ran the code below to fetch the model.

ersilia -v fetch eos31ve

and i got this output

Output

09:09:44 | DEBUG    | Getting schema for API run...
09:09:44 | DEBUG    | No annotated metadata could be retrieved
09:09:44 | DEBUG    | No annotated metadata could be retrieved
09:09:44 | DEBUG    | Latest meta: {'hlm_proba1': ['hlm_proba1']}
09:09:44 | DEBUG    | hlm_proba1 : {'type': 'numeric'}
09:09:44 | DEBUG    | Meta k: ['hlm_proba1']
09:09:44 | DEBUG    | Schema: {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'hlm_proba1': {'type': 'numeric', 'meta': ['hlm_proba1']}}}
09:09:44 | DEBUG    | Done with the schema!
09:09:44 | DEBUG    | This is the schema {'input': {'key': {'type': 'string'}, 'input': {'type': 'string'}, 'text': {'type': 'string'}}, 'output': {'hlm_proba1': {'type': 'numeric', 'meta': ['hlm_proba1']}}}
09:09:44 | DEBUG    | API schema saved at /home/juliet/eos/dest/eos31ve/api_schema.json
09:09:47 | DEBUG    | Fetching eos31ve done in time: 0:05:13.409397s
09:09:47 | INFO     | Fetching eos31ve done successfully: 0:05:13.409397
👍 Model eos31ve fetched successfully!

i then served the model useing

ersilia serve eos31ve

and the output is

🚀 Serving model eos31ve: ncats-hlm

   URL: http://127.0.0.1:52765
   PID: 7917
   SRV: conda

👉 To run model:
   - run

💁 Information:
   - info

I ran the model using the code below.

and the output: output.csv

key	input	hlm_proba1
MCGSCOLBFJQGHM-SCZZXKLOSA-N	Nc1nc(NC2CC2)c3ncn([C@@H]4CC@HC=C4)c3n1	0.009
GZOSMCIZMLWJML-VJLLXTKPSA-N	C[C@]12CCC@HCC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5	0.251
BZKPWHYZMXOIDC-UHFFFAOYSA-N	CC(=O)Nc1sc(nn1)S(=O)=O	0.003
QTBSBXVTEAMEQO-UHFFFAOYSA-N	CC(O)=O	0.0
PWKSKIMOESPYIA-BYPYZUCNSA-N	CC(=O)NC@@HC(O)=O	0.0
BSYNRYMUTXBXSQ-UHFFFAOYSA-N	CC(=O)Oc1ccccc1C(O)=O	0.006
MKUXAQIIEYXACX-UHFFFAOYSA-N	NC1=NC(=O)c2ncn(COCCO)c2N1	0.001
ASMXXROZKSBQIH-VITNCHFBSA-N	OC(C(=O)O[C@H]1C[N+]2(CCCOC3=CC=CC=C3)CCC1CC2)(C1=CC=CS1)C1=CC=CS1	0.885
ULXXDDBFHOBEHA-CWDCEQMOSA-N	CN(C)C\C=C\C(=O)NC1=C(O[C@H]2CCOC2)C=C2N=CN=C(NC3=CC(Cl)=C(F)C=C3)C2=C1	0.12
HXHWSAZORRCQMX-UHFFFAOYSA-N	CCCSc1ccc2nc(NC(=O)OC)[nH]c2c1	0.14
OFCNXPDARWKPPY-UHFFFAOYSA-N	O=C1N=CN=C2NNC=C12	0.008
YVPYQUNUQOZFHG-UHFFFAOYSA-N	CC(=O)Nc1c(I)c(NC(C)=O)c(I)c(C(O)=O)c1I	0.0
LKCWBDHBTVXHDL-RMDFUYIESA-N	NCCC@HC(=O)N[C@@H]1CC@H C@@H C@H[C@H]1O[C@H]3OC@H C@@H C@H[C@H]3O	0.001
XSDQTOBWRPYKKA-UHFFFAOYSA-N	NC(N)=NC(=O)c1nc(Cl)c(N)nc1N	0.0
IYIKLHRQXLHMJQ-UHFFFAOYSA-N	CCCCc1oc2ccccc2c1C(=O)c3cc(I)c(OCCN(CC)CC)c(I)c3	0.226
KRMDCWKBEZIMAB-UHFFFAOYSA-N	CN(C)CCC=C1c2ccccc2CCc3ccccc13	0.135
HTIQEAQVCYTUBX-UHFFFAOYSA-N	CCOC(=O)C1=C(COCCN)NC(=C(C1c2ccccc2Cl)C(=O)OC)C	0.59
OVCDSSHSILBFBN-UHFFFAOYSA-N	CCN(CC)Cc1cc(Nc2ccnc3cc(Cl)ccc23)ccc1O	0.27
MQXQVCLAUDMCEF-CWLIKTDRSA-N	O.O.O.CC1(C)S[C@@H]2C@HC(=O)N2[C@H]1C(O)=O	0.003
APKFDSVGJQXUKY-ZNVUZQDLSA-N	C[C@H]1OC@@HCC@H[C@H]3C(O)=O)C@@H C@@H[C@@H]1O	0.001
AVKUERGKIZMTKX-NJBDSQKTSA-N	CC1(C)S[C@@H]2C@HC(=O)N2[C@H]1C(O)=O	0.034
YBBLVLTVTVSKRW-UHFFFAOYSA-N	CC(C)(C#N)c1cc(Cn2cncn2)cc(c1)C(C)(C)C#N	0.003

my observation

Original Model Predictions: Probabilities less than 1, predicting as "stable."

Ersilia Model Hub Predictions: Probabilities less than 0.5, also predicting as "stable."

This suggests that both models agree on instances being predicted as "stable" based on the given threshold. Consistency in predictions is a positive indication, especially when comparing with a reputable model like the Ersilia Model Hub.

julietowah commented 1 year ago

Install and run Docker!

i installed docker using this code: sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin and tested it with: sudo docker run hello-world

output: Hello from Docker! This message shows that your installation appears to be working correctly.

julietowah commented 1 year ago

1st Model Suggestion

KGNN: Knowledge Graph Neural Network for Drug-Drug Interaction Prediction

Publication: KGNN Source Code (TensorFlow): sourcecode Dataset: V5.1.4

About this model Drug-Drug Interaction (DDI) prediction is a critical area within pharmacology and drug discovery that aims to anticipate and assess potential interactions between different drugs when taken simultaneously. Understanding these interactions is vital for patient safety and optimizing therapeutic outcomes, as some drug combinations can lead to adverse effects, reduced efficacy, or unexpected outcomes. DDI prediction plays a pivotal role in drug safety and personalized medicine, guiding clinicians and patients in making informed decisions regarding medication usage and combinations.

Why this model matters to Ersilia

Drug-Drug Interaction (DDI) prediction is crucial for Ersilia as it aligns with their mission of providing cutting-edge computational tools and models to researchers, pharmaceutical companies, and healthcare professionals in the field of drug discovery and development. Here's how DDI prediction is important to Ersilia:

Safety and Efficacy: Predicting potential drug-drug interactions is fundamental to ensure the safety and efficacy of drug therapies. By offering accurate DDI prediction models, Ersilia helps in identifying possible interactions that could lead to adverse effects or reduced drug efficacy, enabling better-informed decisions.

Patient-Centric Approach: Ersilia's models contribute to a patient-centric approach to healthcare. Predicting DDIs helps personalize treatment plans, ensuring that prescribed drugs are compatible and safe for individual patients, considering their unique health conditions and existing medication regimens.

Enhanced Drug Development: Drug developers can utilize DDI prediction models during the drug development process to assess potential interactions early on. This can streamline the development pipeline, reduce costs, and improve the chances of bringing successful, safe drugs to market.

Data-Driven Insights: Ersilia's DDI prediction models leverage data and advanced computational techniques. By providing data-driven insights into potential DDIs, Ersilia empowers researchers and practitioners with valuable information to optimize drug combinations and mitigate risks.

Support for Healthcare Professionals: Healthcare professionals can benefit from Ersilia's DDI prediction models to make informed decisions about drug prescribing. This support is critical in clinical settings, where physicians need quick and accurate information about potential interactions to provide the best care for their patients.

Research and Innovation: Ersilia's involvement in DDI prediction showcases their commitment to research and innovation. Continuously improving and expanding DDI prediction models contributes to the advancement of pharmacology and fosters innovation in drug discovery and healthcare practices.

Code Implementation

To run the code, you need the following dependencies:

Python == 3.6.6 Keras == 2.3.0 Tensorflow == 1.13.1 scikit-learn == 0.22

The TensorFlow-GPU library has progressed beyond version 1.13.1, with the most recent releases belonging to the 2.x series. Therefore, it's important to update the model to ensure compatibility and optimal performance on the latest versions.

DhanshreeA commented 1 year ago

Hi @julietowah

Thank you for the very detailed updates, and apologies for responding late.

If you find yourself finished with Week 3's tasks sooner and you still have some time remaining, I would recommend a bonus task (Please note, not completing this will not count towards your application, so no pressure there.) Can you extract the code required to run one of the NCATS Models (either one of RLM or HLCS) into a simple python script and run it against the EML? So in short, it should do the following:

Take SMILE inputs and preprocess them as per the original code
Load the model from a model path
Run predictions

The NCATS repo provides a full fledged server with a lot of extra functionality that models within the Ersilia Model Hub do not need. When we implement models within the hub, we try to keep the bare minimum required to process inputs as needed, load a model from a local file path, and make predictions. This generally involves extracting only the essential parts from the original codebase.

DhanshreeA commented 1 year ago

1st Model Suggestion

KGNN: Knowledge Graph Neural Network for Drug-Drug Interaction Prediction

Publication: KGNN Source Code (TensorFlow): sourcecode Dataset: V5.1.4

About this model Drug-Drug Interaction (DDI) prediction is a critical area within pharmacology and drug discovery that aims to anticipate and assess potential interactions between different drugs when taken simultaneously. Understanding these interactions is vital for patient safety and optimizing therapeutic outcomes, as some drug combinations can lead to adverse effects, reduced efficacy, or unexpected outcomes. DDI prediction plays a pivotal role in drug safety and personalized medicine, guiding clinicians and patients in making informed decisions regarding medication usage and combinations.

Why this model matters to Ersilia

Drug-Drug Interaction (DDI) prediction is crucial for Ersilia as it aligns with their mission of providing cutting-edge computational tools and models to researchers, pharmaceutical companies, and healthcare professionals in the field of drug discovery and development. Here's how DDI prediction is important to Ersilia:

Safety and Efficacy: Predicting potential drug-drug interactions is fundamental to ensure the safety and efficacy of drug therapies. By offering accurate DDI prediction models, Ersilia helps in identifying possible interactions that could lead to adverse effects or reduced drug efficacy, enabling better-informed decisions.

Patient-Centric Approach: Ersilia's models contribute to a patient-centric approach to healthcare. Predicting DDIs helps personalize treatment plans, ensuring that prescribed drugs are compatible and safe for individual patients, considering their unique health conditions and existing medication regimens.

Enhanced Drug Development: Drug developers can utilize DDI prediction models during the drug development process to assess potential interactions early on. This can streamline the development pipeline, reduce costs, and improve the chances of bringing successful, safe drugs to market.

Data-Driven Insights: Ersilia's DDI prediction models leverage data and advanced computational techniques. By providing data-driven insights into potential DDIs, Ersilia empowers researchers and practitioners with valuable information to optimize drug combinations and mitigate risks.

Support for Healthcare Professionals: Healthcare professionals can benefit from Ersilia's DDI prediction models to make informed decisions about drug prescribing. This support is critical in clinical settings, where physicians need quick and accurate information about potential interactions to provide the best care for their patients.

Research and Innovation: Ersilia's involvement in DDI prediction showcases their commitment to research and innovation. Continuously improving and expanding DDI prediction models contributes to the advancement of pharmacology and fosters innovation in drug discovery and healthcare practices.

Code Implementation

To run the code, you need the following dependencies:

Python == 3.6.6 Keras == 2.3.0 Tensorflow == 1.13.1 scikit-learn == 0.22

The TensorFlow-GPU library has progressed beyond version 1.13.1, with the most recent releases belonging to the 2.x series. Therefore, it's important to update the model to ensure compatibility and optimal performance on the latest versions.

Hi @julietowah very interesting paper! Thank you for recommending this. Unfortunately the inputs within this model do not seem to be conformant with what Ersilia Model Hub expects at the time (ie SMILES strings). I may have missed something within the code (ie preprocessing etc), please feel free to correct me.

julietowah commented 1 year ago

thank you so much @DhanshreeA i will look try to get a better model suggestion your reply really helped a lot

julietowah commented 1 year ago

1st Model Suggestion

decided to use this as my 1st suggestion ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties Publication: ADMET Source Code: sourcecode MGA model name : MGA

about this model:

ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties are crucial in drug development to ensure a balance between potency, pharmacokinetics, and safety of drug candidates. Traditionally, ADMET evaluation followed potency determination, often leading to late-stage adverse effects discovery. The role of ADMET evaluation has become pivotal, contributing to nearly 50% of drug development attrition. In silico prediction models and web tools have emerged to efficiently predict ADMET properties, aiding early-stage compound screening and lead optimization.

Why this model matters to Ersilia

ADMETlab 2.0 aligns with Erisia's goals in drug discovery and development by offering a powerful tool for predicting crucial properties of drug candidates, ultimately contributing to more effective and efficient drug development processes.

Code Implementation requirements： python 3.6 anaconda dgl 0.4.3 xgboost rdkit pytorch sklearn ADMETlab 2.0 is a web application built using the Python web framework Django and deployed on an Ubuntu Linux system hosted on Aliyun's elastic compute service. The web access is facilitated through the Nginx web server,The prediction models are implemented using Python with deep learning packages PyTorch and DGL, as well as the RDKit package for cheminformatics support.

julietowah commented 1 year ago

2nd Model Suggestion

Deep learning to generate in silico chemical property libraries and candidate molecules for small molecule identification in complex samples

Publication: Deep learning Source Code: sourcecode model name : darkchem

About this model: DarkChem is a powerful framework that advances the field of small molecule characterization, particularly in metabolomics. Its ability to predict properties directly from molecular structures, focus on metabolomics-related properties, and adapt to different property prediction tasks positions it as a valuable tool for researchers working with complex mixtures and experimental data in various scientific disciplines.

why this model matters to Erisilia

DarkChem's ability to identify and generate small molecule structures in complex mixtures and can enhance the understanding of metabolite roles and interactions within biological processes which can aid in identifying potential drug candidates and their properties, streamlining the drug design and optimization process

implementation:

DarkChem was written in Python(version 3.6) and uses Keras with Tensorflow and from my observation i think it is ready to use

we have to create a conda virtual environment with required dependencies:

conda create -n darkchem -c conda-forge -c rdkit -c openbabel python=3.7 keras tensorflow rdkit openbabel numpy scipy scikit-learn pandas

then avtivate it with

conda activate darkchem

install dackchem using pip with

# clone/install
git clone https://github.com/pnnl/darkchem.git
pip install darkchem/

# direct
pip install git+https://github.com/pnnl/darkchem

it takes a very long time for the required dependencies to install but am thinking its from my system

julietowah commented 1 year ago

hi @DhanshreeA do you think this https://github.com/uzh-dqbm-cmi/side-effects/ is okey for suggestion

julietowah commented 1 year ago

3rd Model Suggestion

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT

Publication: MolPMoFiT Source Code: sourcecode model name : MolPMoFiT

About this model:

MolPMoFiT is a powerful approach for predicting molecular properties accurately, even with smaller and challenging datasets. It combines self-supervised pre-training on a large dataset of unlabeled compounds with task-specific fine-tuning. This approach outperforms other machine learning methods on benchmark datasets, making it valuable for diverse chemical prediction tasks. MolPMoFiT aids in drug discovery by improving the accuracy and reliability of predicting molecular properties, which is crucial in the early stages of drug development. It can effectively handle smaller and challenging datasets, making it a valuable tool for optimizing drug candidates. This approach streamlines the process of identifying compounds with desired properties, ultimately speeding up drug discovery and reducing costs. Additionally, by utilizing self-supervised learning on a vast dataset of unlabeled compounds, MolPMoFiT harnesses the wealth of publicly-available chemical information to enhance prediction models, facilitating more informed decision-making in drug development.

How dose this model matter to Erisilia:

MolPMoFiT's capabilities are highly relevant to Ersilia's mission, which involves advancing research and innovation in various scientific fields, including drug discovery and cheminformatics. Here's how MolPMoFiT can contribute to Ersilia's mission:

Accelerating Drug Discovery: MolPMoFiT enhances the drug discovery process by improving the accuracy of predicting molecular properties. This acceleration can lead to the discovery of new drug candidates more rapidly, aligning with Ersilia's goal of advancing research and innovation.

Cost Reduction: By increasing the efficiency and reliability of predicting molecular properties, MolPMoFiT can help reduce the costs associated with drug development. This cost-saving aspect is essential for optimizing resource allocation in scientific endeavors.

Leveraging Public Datasets: MolPMoFiT utilizes a vast dataset of unlabeled compounds from public sources. This approach aligns with Ersilia's mission of promoting the open sharing of scientific knowledge and resources for the benefit of the research community.

Promoting Data-Driven Research: Ersilia encourages data-driven approaches in scientific research. MolPMoFiT is a prime example of harnessing data to improve prediction models, contributing to more informed and data-driven decision-making in scientific endeavors.

In summary, MolPMoFiT's capabilities directly align with Ersilia's mission by advancing research, promoting open data sharing, and supporting cost-effective and data-driven scientific investigations. It can be a valuable asset for researchers and organizations committed to Ersilia's goals.

code implementation:

Environment Setup:

It's recommended to create a Conda environment using the provided molpmofit.yml file, ensuring all necessary dependencies are in place for MolPMoFiT.

Datasets: The required datasets for MolPMoFiT experiments are stored in the data folder. data/MSPM contains the dataset for training a general molecular structure prediction model. data/QSAR holds datasets for Quantitative Structure-Activity Relationship (QSAR) tasks.

Experiments: The code for conducting experiments is available in Jupyter Notebook format within the notebooks folder. Multiple notebooks cover various tasks, including training the general MSPM model, fine-tuning task-specific models, and performing QSAR classification and regression.

Pre-trained Models: Pre-trained models are available for download. ChEMBL_1M_atom is trained on 1 million ChEMBL molecules with atomwise tokenization. ChEMBL_1M_SPE is trained on the same dataset with SMILES Pair Encoding tokenization. Specific instructions for using these pre-trained models are provided in respective notebooks. Overall, this information equips users with the resources and guidance needed to set up the MolPMoFiT environment, access datasets, run experiments, and utilize pre-trained models effectively.

julietowah commented 1 year ago

@DhanshreeA about writing the python script

i have some errors i am running into which i would love fix first though i will continue to work on it and hopefully be able to solve it i really appreciate the extra project i would really love to get the final script

GemmaTuron commented 1 year ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!

ersilia-os / ersilia