ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
200 stars 128 forks source link

โœ๏ธ Contribution period: Promise Fru #821

Closed PromiseFru closed 10 months ago

PromiseFru commented 11 months ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

PromiseFru commented 11 months ago

Issue: Error Running Ersilia Calculate Command

Description

I've set up Ersilia and confirmed its recognition in the CLI by using ersilia --help, which displayed available command options. I've also successfully fetched and served the eos3b5e model. However, I'm facing an error when trying to calculate the molecular weight using the ersilia -v api calculate -i "CCCC" command.

The error message I received is:

KeyError: 'calculate'

I suspect the problem may be related to the command not recognizing "calculate" as a valid API name in the schema. I've made sure that I've installed all the necessary packages, especially Git-LFS, as it's a common mistake to overlook.

I would appreciate any help or guidance on resolving this issue.

Expected Behavior

I expect that running the ersilia -v api calculate -i "CCCC" command should calculate the molecular weight of the input molecule ("CCCC") and display the result in the CLI as follows:

{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "mw": 58.123999999999995
    }
}

Actual Behavior

When I run the ersilia -v api calculate -i "CCCC" command, it results in a KeyError with the message: KeyError: 'calculate'.

Steps to Reproduce

  1. Set up Ersilia.
  2. Verify recognition and available command options using ersilia --help.
  3. Fetch and serve the eos3b5e model with the following commands:
    • ersilia -v fetch eos3b5e
    • ersilia serve eos3b5e
  4. Attempt to calculate the molecular weight using the command: ersilia -v api calculate -i "CCCC".

Environment

Operating System

Packages

Please let me know if you need any additional information to help resolve this issue.

leilayesufu commented 11 months ago

Hi, Check this out. I made a bit of progress here

HellenNamulinda commented 11 months ago

ersilia -v api calculate -i "CCCC"

To make predictions, ersilia uses a standard API run. So, instead of ersilia -v api calculate -i "CCCC", use ersilia -v api run -i "CCCC"

PromiseFru commented 11 months ago

ersilia -v api calculate -i "CCCC"

To make predictions, ersilia uses a standard API run. So, instead of ersilia -v api calculate -i "CCCC", use ersilia -v api run -i "CCCC"

Hello @HellenNamulinda ,

Thank you for your reply. I ran the command ersilia -v api run -i "CCCC" and I received a new error message:

"TypeError: object of type 'NoneType' has no len()."

I've attached the log file for your review. log_output.txt

leilayesufu commented 11 months ago

HI, i fixed that by changing Opening the code at /home/leila/ersilia/ersilia/io/readers/file.py, In the read_input_columns function at line 321. I changed the code from if len(h) == 1: to if h is not None and len(h) == 1:

PromiseFru commented 11 months ago

HI, i fixed that by changing Opening the code at /home/leila/ersilia/ersilia/io/readers/file.py, In the read_input_columns function at line 321. I changed the code from if len(h) == 1: to if h is not None and len(h) == 1:

Thank you @leilayesufu. It works now ๐Ÿš€

PromiseFru commented 11 months ago

Hello @samuelmaina @leilayesufu , I went ahead and opened a pull request (PR) to fix the issue mentioned above. This will make it easier for other contributors as they won't need to manually edit the files.

You can check out the PR here: https://github.com/ersilia-os/ersilia/pull/827

DhanshreeA commented 11 months ago

@PromiseFru thanks for your efforts, you generally don't need to modify the Ersilia code at this point and therefore I have closed #827.

I have tried running the molecular weight model in an ersilia environment with Python 3.7 environment and Conda versions 23.5.2 and I do not run into this issue. Could you try reinstalling ersilia and running this again?

PromiseFru commented 11 months ago

Motivation statement

I joined Outreachy because I heard it is an organization that provides opportunities for underrepresented individuals in the technical industry. Being a part of this group in my society, I found it comforting to discover a community that could provide assistance. My decision to join Ersilia was influenced by the fact that the required skill sets for the project align with my current competencies, such as Python, Conda, and Docker. While I haven't had extensive experience with Conda, I have worked on projects that primarily utilized pip. I saw this as a valuable opportunity to learn Conda and apply this knowledge in practical situations.

Additionally, I have a strong interest in the field of AI/ML and have made several attempts to teach myself the basics. However, I have found it challenging to gain a solid understanding of its practical applications. I believe that collaborating with the Ersilia community will provide me with an excellent opportunity to learn and grow in this subject. I am particularly drawn to Ersilia's mission and goals, as it focuses on providing biomedical AI/ML solutions to scientists worldwide. This resonates with me because my society does not prioritize medical technology solutions. If given the chance to learn and advance in this field, I hope to join the Ersilia community and contribute to medical solutions that can positively impact my society.

During the internship, my intention is to learn, contribute, and collaborate as much as possible within the Ersilia community. I will use this opportunity to learn from the experts in the Ersilia community to build a strong foundation in the AI/ML field so that I can give back to the community effectively. After the internship, I plan to leverage the knowledge, experience, and skills I have gained to create potential solutions for my society, all while continuing to contribute to the Ersilia community. I aspire to build a career in AI/ML, and I firmly believe that growth can only be achieved with the guidance of experts and the support of a community dedicated to personal and collective development. I see the Ersilia community as the ideal place for this growth to occur.

PromiseFru commented 11 months ago

Week 1 Tasks Completed โœ…

The first week of my contribution period at Ersilia Model Hub was fantastic. It mainly involved getting acquainted with the community, meeting my fellow contributors, understanding Ersilia's mission, learning about community collaboration, and setting up the necessary tools and environment to run Ersilia's codebase.

Activities ๐Ÿ“

Task 1: Join the Communication Channels

Task 2: Open an Issue

Task 3: Install the Ersilia Model Hub

Task 4: Motivation Statement

Task 5: Open an Application to Ersilia

I'm excited about the Community call scheduled for Friday, October 6th, at 5:00 pm CET. I'm ready to assist any contributor who needs help with their Week 1 tasks. I'm also looking forward to Week 2 tasks. Should I start Week 2 tasks now, or should I wait until Week 2 officially begins, @DhanshreeA ?

DhanshreeA commented 11 months ago

Hi @PromiseFru thanks for the detailed updated. I see that you have completed all the tasks for Week 1, please go ahead and get started with Week 2 tasks, you do not need to wait. :)

Kadeniyi23 commented 11 months ago

@DhanshreeA has mentioned on the slack channel that we can go ahead with week 2. Good luck with that

PromiseFru commented 11 months ago

Why the ImageMol Model? ๐Ÿค“

I chose this model because I'm deeply intrigued by its potential to address a critical issue in my community. Drug abuse is unfortunately prevalent, and it's alarming to witness even pharmacies contributing to this problem by providing incorrect dosages or dispensing medications without prescriptions solely for profit. The model's ability to predict molecular targets and evaluate drug properties could play a significant role in ensuring the safe and effective use of medicines. I'm eager to contribute to its implementation and see how it can make a positive impact on the healthcare landscape in my society.

PromiseFru commented 11 months ago

ImageMol Model Setup

Visit the model's Github repository here

Install environments

GPU environment

๐Ÿ› ๏ธ Install CUDA 10.1

sudo apt install nvidia-cuda-toolkit

๐Ÿงช Test installation

nvcc --version
----
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

โœ… Success

Create and activate imagemol conda environment

conda create -n imagemol python=3.7.3
conda activate imagemol

โœ… Success

Download some packages

๐Ÿ› ๏ธ Install rdkit

conda install -c rdkit rdkit 

โœ… Success


๐Ÿ› ๏ธ Install torch

pip install https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl

โŒ Failed to install torch

๐Ÿ“œ Error logs

Collecting torch==1.4.0
  ERROR: HTTP error 403 while getting https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==1.4.0 from https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl for URL https://download.pytorch.org/whl/cu101/torch-1.4.0-cp37-cp37m-linux_x86_64.whl

๐Ÿ”จ Possible workaround

pip install torch==1.4.0
----
Successfully installed torch-1.4.0

โœ… Success


๐Ÿ› ๏ธ Install torchvision

pip install https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl

โŒ Failed to install torchvision

๐Ÿ“œ Error logs

Collecting torchvision==0.5.0
  ERROR: HTTP error 403 while getting https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torchvision==0.5.0 from https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl for URL https://download.pytorch.org/whl/cu101/torchvision-0.5.0-cp37-cp37m-linux_x86_64.whl

๐Ÿ”จ Possible workaround

pip install torchvision==0.5.0
----
Successfully installed torchvision-0.5.0

โœ… Success


๐Ÿ› ๏ธ Install torch-cluster torch-scatter torch-sparse torch-spline-conv

pip install torch-cluster torch-scatter torch-sparse torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0%2Bcu101.html

โŒ Failed to install torch-cluster torch-scatter torch-sparse torch-spline-conv

๐Ÿ“œ Error logs: output_log.txt

๐Ÿ”จ Possible workaround


๐Ÿ› ๏ธ Clone and Navigate to Repository

git clone git@github.com:HongxinXiang/ImageMol.git
cd ImageMol

Pretraining

Prepare Training data

You can find the toy dataset in ./datasets/toy/pretraining/

๐Ÿ› ๏ธ Pre-train ImageMol using a single GPU on toy dataset

python pretrain.py --ckpt_dir ./ckpts/pretraining-toy/ \
                   --checkpoints 1 \
                   --Jigsaw_lambda 1 \
                   --cluster_lambda 1 \
                   --constractive_lambda 1 \
                   --matcher_lambda 1 \
                   --is_recover_training 1 \
                   --batch 16 \
                   --dataroot ./datasets/toy/pretraining/ \
                   --dataset data \
                   --gpu 0 \
                   --ngpu 1

โŒ Failed to pre-train ImageMol

๐Ÿ“œ Error logs: output_log2.txt

๐Ÿ”จ Possible workaround

Richiio commented 11 months ago

For torch-cluster and the rest of them. Try installing them one by one. As in pip install torch-cluster, pip install torch-sparse

Richiio commented 11 months ago

This link would help you https://pytorch-geometric.readthedocs.io/en/1.3.2/notes/installation.html

Richiio commented 11 months ago

Adding CUDA to path and whatnot

PromiseFru commented 11 months ago

Hello @Richiio

Thank you so much for the assistance.

I've tried installing the packages as you've suggested, and it turns out torch-spline-conv installed with no issues, but torch-cluster, torch-scatter, and torch-sparse failed with similar errors as in the log file above. I can provide their individual log files if needed. However, a repeated pattern of all the errors contained the following information:

error: subprocess-exited-with-error

ร— python setup.py bdist_wheel did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [161 lines of output]
  No CUDA runtime is found, using CUDA_HOME='/usr'
  running bdist_wheel
  running build
  running build_py
  creating build
...

This issue appears to be a hardware limitation of my computer. I further confirmed this by visiting the resource you shared and using the guide provided to Check if PyTorch is installed with CUDA support. The command returned a False on my computer.

python -c "import torch; print(torch.cuda.is_available())"
>>> False

I'm not sure if there is a workaround for this one, except changing my computer for one compatible with CUDA. However, since I can't do that right now, I'll have to explore other models if they have environments that can run on my computer comfortably.

leilayesufu commented 11 months ago

Hi, try using conda to install the packages conda install pyg -c pyg Try this command

maureen-mugo commented 11 months ago

You could remove the PyTorch you installed and install the CPU version. Just visit the PyTorch official page and select the CPU version. Your current GPU can't run the CUDA version, that's why you are getting an error with PyTorch and the other packages.

PromiseFru commented 11 months ago

Hi, try using conda to install the packages conda install pyg -c pyg Try this command

Hello @leilayesufu

Thank you for this solution. It helped me make significant progress with my setup. I successfully installed torch-cluster, torch-scatter, and torch-sparse, and they built without any issues. Now, I'm excited to move on to pretraining the model! ๐Ÿš€

PromiseFru commented 11 months ago

You could remove the PyTorch you installed and install the CPU version. Just visit the PyTorch official page and select the CPU version. Your current GPU can't run the CUDA version, that's why you are getting an error with PyTorch and the other packages.

Hello @maureen-mugo

Thank you very much for your help. Running the CPU version of PyTorch didn't work because the ImageMol model appears to require a CUDA-enabled PyTorch. I will soon attach the logs to provide more details.

DhanshreeA commented 11 months ago

Hi @PromiseFru Many thanks for the detailed updates! I see that you have resolved these issues and are moving on to pre training however I am leaving some remarks here for other contributors, should they need it.

  1. Installing CUDA but no GPUs detected : Yes this is very much possible. CUDA is a proprietary API for working with NVIDIA graphics cards. While you can install it in principle (just like any other software) on your system, it will not find the required hardware to work with. However, good news :rocket: Ersilia models are intended to be used with CPUs with the primary aim of making them usable for low resource settings. Therefore, it is in fact recommended that you work with this model (or any other model within Ersilia in the future) using CPU compatible code.

  2. Issues with building torch derived packages As for the other torch dependencies, for some reason pip does not do a good job of figuring out which versions of torch-cluster, torch-spline-cov, torch-scatter etc. are compatible with the installed torch version. As @Richiio rightly mentioned, try installing them one by one. I would also recommend looking through the release notes of these libraries and checking which version is compatible with your torch version The release notes are always a great place for clearing up dependency conflicts. That helped me last time when I incorporated this model.

Good luck with experimenting with the model!

PromiseFru commented 11 months ago

Switching to the STOUT Model

Initially, I wanted to work with the ImageMol model, but due to hardware limitations, I've been unable to set it up and run the model. So, I began searching for another possible model that I'd like to work on, and the STOUT model stood out. Similar to my motivation for choosing the ImageMol model, after reading up about the STOUT model and its ability to save scientists and researchers a significant amount of time and effort when working with chemical compounds, as well as reducing the risk of errors in scientific research and medical applications through its deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string, and vice versa, I was convinced that this model could be applied to creating medical tech solutions for my society and others. Additionally, this model is compatible with my computer's specifications. I'm genuinely curious to learn how this model is applied in practice.

PromiseFru commented 11 months ago

STOUT Model Setup

Visit the model's Github repository here

Install environments

Create and activate STOUT conda environment

conda create --name STOUT python=3.8 
conda activate STOUT

โœ… Success

Install STOUT

Using PyPi

pip install STOUT-pypi

โœ… Success

Test Model

I used the predictor_demo.py in the STOUT repository to test the model locally.

python predictor_demo.py

Outputs

IUPAC_predictions.txt SMILES_predictions.txt

โœ… Success

PromiseFru commented 11 months ago

Prediction of EML dataset with STOUT Model

The STOUT model, a deep learning neural machine translation model for chemical compounds, can predict IUPAC (International Union of Pure and Applied Chemistry) names based on a given SMILES (Simplified Molecular Input Line Entry System) notation for a chemical compound. The model works in three steps:

Prediction Process

import time
import csv
from STOUT import translate_forward

input_file_name = "eml_canonical.csv"
output_file_name = "eml_canonical_IUPAC_predictions.csv"

start = time.time()

def translate(ln_num: int, total_ln: int, smiles: str, field: str):
    try:
        prediction_start_time = time.time()
        print("===================================================")
        print(f"Line {ln_num}/{total_ln} Processing ({field}) ...")
        IUPAC_name = translate_forward(smiles)
        prediction_time = time.time() - prediction_start_time
        print(f"SMILES name: {smiles}")
        print(f"IUPAC name: {IUPAC_name}")
        print(f"Time: {prediction_time:.4f} sec")
        return IUPAC_name
    except Exception as error:
        print(f"Line {ln_num}/{total_ln} - Error: {str(error)}")
        return f"Error: {str(error)}"

with open(input_file_name, "r", encoding="utf-8") as input_file, open(
    output_file_name, "w", encoding="utf-8", newline=""
) as output_file:
    total_lines = sum(1 for _ in input_file) - 1
    input_file.seek(0)

    csv_reader = csv.reader(input_file)
    csv_writer = csv.writer(output_file)

    next(csv_reader)
    header = ["drugs", "iupac", "can_iupac"]
    csv_writer.writerow(header)

    for line_number, columns in enumerate(csv_reader, start=1):
        columns[1] = translate(line_number, total_lines, columns[1], "smiles")
        columns[2] = translate(line_number, total_lines, columns[2], "can_smiles")

        csv_writer.writerow(columns)

elapsed_time = time.time() - start
print(f"\nTotal time taken for all predictions: {elapsed_time:.4f} seconds")

The prediction process took a total of 68,387.5641 seconds.

Sources

PromiseFru commented 11 months ago

Fetch and install STOUT Model from Ersilia Model Hub

I accessed the STOUT model on GitHub, which I found on the Ersilia Model Hub.

EOS model ID: eos4se9 Slug: smiles2iupac

Fetch the model from the remote repository using the Ersilia identifier eos4se9

ersilia fetch eos4se9

โœ… Success


Serve the model

ersilia serve eos4se9

โœ… Success


Check how the smiles2iupac model is running

docker ps --format '{{.ID}}\t{{.Image}}' | grep 'eos4se9'

-----

0c92ffcad288    ersiliaos/eos4se9:latest

The smiles2iupac model is running from a docker container

โœ… Success


Get information about the model

ersilia info

-----

๐Ÿš€ STOUT: SMILES to IUPAC name translator
Small molecules are represented by a variety of machine-readable strings (SMILES, InChi, SMARTS, among others). On the contrary, IUPAC (International Union of Pure and Applied Chemistry) names are devised for human readers. The authors trained a language translator model treating the SMILES and IUPAC as two different languages. 81 million SMILES were downloaded from PubChem and converted to SELFIES for model training. The corresponding IUPAC names for the 81 million SMILES were obtained with ChemAxon molconvert software.

๐Ÿ’ Identifiers
Model identifiers: eos4se9
Slug: smiles2iupac

๐Ÿค“ Code and parameters
GitHub: https://github.com/ersilia-os/eos4se9
AWS S3: https://ersilia-models-zipped.s3.eu-central-1.amazonaws.com/eos4se9.zip

๐Ÿ‹ Docker
Docker Hub: https://hub.docker.com/r/ersiliaos/eos4se9
Architectures: AMD64

For more information, please visit https://ersilia.io/model-hub

โœ… Success

PromiseFru commented 11 months ago

Prediction of EML dataset with Ersilia STOUT Model (eos4se9)

Run prediction

ersilia -v api run -i eml_canonical.csv -o ersilia_eml_canonical_IUPAC_predictions.csv

After executing the prediction command for over 6 hours, I finally received an exit return on the terminal:

| DEBUG    | Status code: 504
| ERROR    | Status Code: 504
| DEBUG    | Status code: 504
| ERROR    | Status Code: 504
| DEBUG    | Status code: 504
| ERROR    | Status Code: 504
| DEBUG    | Status code: 504
| ERROR    | Status Code: 504
| DEBUG    | Schema available in /home/eos/dest/eos4se9/api_schema.json
| DEBUG    | Done with unique posting
| DEBUG    | Data: outcome
| DEBUG    | Values: [None]
| DEBUG    | Datatype: string_array
ersilia_eml_canonical_IUPAC_predictions.csv

When I examined the output file, it appeared as follows:

key,input,iupacs_names
MCGSCOLBFJQGHM-SCZZXKLOSA-N,Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1,
GZOSMCIZMLWJML-VJLLXTKPSA-N,C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5,
BZKPWHYZMXOIDC-UHFFFAOYSA-N,CC(=O)Nc1sc(nn1)[S](N)(=O)=O,
QTBSBXVTEAMEQO-UHFFFAOYSA-N,CC(O)=O,
PWKSKIMOESPYIA-BYPYZUCNSA-N,CC(=O)N[C@@H](CS)C(O)=O,
BSYNRYMUTXBXSQ-UHFFFAOYSA-N,CC(=O)Oc1ccccc1C(O)=O,

(more data ...)

The iupacs_names column was empty, but my CPUs were still consistently at 90%+ usage, so I assumed the model was still running and would update the file with the IUPAC names.

I attempted to locate any logs from the Docker container using the following command:

docker logs --follow eos4se9_d537

-----

+ [ -z eos4se9 ]
+ ersilia serve -p 3000 eos4se9
๐Ÿš€ Serving model eos4se9: smiles2iupac

   URL: http://127.0.0.1:3000
   PID: 37
   SRV: conda

๐Ÿ‘‰ To run model:
   - run

๐Ÿ’ Information:
   - info
Serving model eos4se9...
+ echo Serving model eos4se9...
+ nginx -g daemon off;

However, all I could see was the nginx start command. I will need to wait a bit longer for my CPUs to return to normal before making any conclusions.

---- A few hours later ----

My CPUs returned to normal usage, but the output file hadn't been updated with the 'iupacs_names.' I decided to investigate further by running the command on a modified set of data, which only contained two rows of the original EML data. I didn't stream the output to a file.

ersilia api run -i eml_canonical_copy.csv

-----

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            null
        ]
    }
}
{
    "input": {
        "key": "GZOSMCIZMLWJML-VJLLXTKPSA-N",
        "input": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5",
        "text": "C[C@]12CC[C@H](O)CC1=CC[C@@H]3[C@@H]2CC[C@@]4(C)[C@H]3CC=C4c5cccnc5"
    },
    "output": {
        "outcome": [
            null
        ]
    }
}

I also ran a single input

ersilia api run -i "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"

-----

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            null
        ]
    }
}

I tried using a butane SMILES notation CCCC

ersilia api run -i "CCCC"

-----

{
    "input": {
        "key": "IJDNQMDRQITEOD-UHFFFAOYSA-N",
        "input": "CCCC",
        "text": "CCCC"
    },
    "output": {
        "outcome": [
            "butane"
        ]
    }
}

The model correctly predicted the IUPAC name. At that moment, I wasn't sure if I was making a mistake, and I would appreciate some help ๐Ÿ™๐Ÿพ .

Environment

Operating System

Packages

leilayesufu commented 11 months ago

Try serving the model again and try with different other inputs

PromiseFru commented 11 months ago

Try serving the model again and try with different other inputs

Hello @leilayesufu,

I appreciate your assistance.

I already recreated my virtual environment, reinstalled ersilia, re-fetched the model, and ran tests with methane, butan-1-ol, and butane, using SMILES C, OCCCC, and CCCC respectively. The model correctly predicted their IUPAC names. However, when I attempted to run any SMILES from the EML dataset, the model returned null.

joiboi08 commented 11 months ago

Hi @PromiseFru, were you able to run eos4se9 with EML dataset?

PromiseFru commented 11 months ago

Hi @PromiseFru, were you able to run eos4se9 with EML dataset?

Hello @joiboi08

I've successfully made predictions directly from the Docker container. I tested it with a few individual SMILES, and now I'm running the EML dataset directly from the Docker container. I'll keep you updated on whether it makes predictions this time or not.

DhanshreeA commented 11 months ago

Hi @PromiseFru I seem to be facing some issues with running ersilia locally and I am unable to reproduce your issue right now but I will look into it for sure. Meanwhile could you tell me what you mean by the following:

I've successfully made predictions directly from the Docker container. I tested it with a few individual SMILES, and now I'm running the EML dataset directly from the Docker container. I'll keep you updated on whether it makes predictions this time or not.

PromiseFru commented 11 months ago

Thank you, @DhanshreeA, for your response.

Over the past few days, I've faced challenges making predictions from my locally installed version of Ersilia for the SMILES in the EML dataset, as mentioned here. Today, I tried a different approach since I noticed the logs showed a Status code 504 error, indicating that the Ersilia API might not be receiving a response from the Docker container in a timely manner. I decided to run Ersilia directly from the Docker container.

docker exec -it eos4se9_c020 bash

I then made a prediction using a single SMILE from the EML dataset:

ersilia -v api run -i "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"

-----

{
    "input": {
        "key": "MCGSCOLBFJQGHM-SCZZXKLOSA-N",
        "input": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1",
        "text": "Nc1nc(NC2CC2)c3ncn([C@@H]4C[C@H](CO)C=C4)c3n1"
    },
    "output": {
        "outcome": [
            "[(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol"
        ]
    }
}

The prediction was successful, unlike my previous attempts to run predictions using my locally installed version of Ersilia as described here.

As a result, I decided to proceed with the prediction process as outlined here, but this time within the eos4se9_c020 Docker container.

PromiseFru commented 11 months ago

I was able to run predictions directly within the eos4se9_c020 docker container. It turned out successful, but unfortunately, after more than 5 hours of prediction, it ended with a 'Connection error.' I forgot to capture the 'Connection error traceback.' Here are the steps I followed:

  1. Access the docker container:
docker exec -it eos4se9_c020 bash
  1. Download the EML dataset:
wget https://raw.githubusercontent.com/ersilia-os/ersilia/master/notebooks/eml_canonical.csv
  1. Run predictions:
ersilia -v api run -i eml_canonical.csv -o ersilia_eml_canonical_IUPAC_predictions.csv

-----
#Output:

10:02:53 | DEBUG    | Reading standard file from /tmp/ersilia-h_zmdo4x/standard_input_file.csv
10:02:53 | DEBUG    | File has 443 lines
10:02:53 | DEBUG    | No file splitting necessary!
10:02:54 | DEBUG    | Reading card from eos4se9
10:02:54 | DEBUG    | Reading shape from eos4se9
10:02:54 | DEBUG    | Input Shape: Single
10:02:54 | DEBUG    | Input type is: compound
10:02:54 | DEBUG    | Input shape is: Single
10:02:54 | DEBUG    | Importing module: .types.compound
10:02:54 | DEBUG    | Checking RDKIT and other requirements necessary for compound inputs
10:02:54 | DEBUG    | InputShapeSingle shape: Single
10:02:54 | DEBUG    | API eos4se9:run initialized at URL http://127.0.0.1:3000
10:02:54 | DEBUG    | Schema available in /root/eos/dest/eos4se9/api_schema.json
10:02:54 | DEBUG    | Posting to run
10:02:54 | DEBUG    | Batch size 100
10:02:54 | DEBUG    | Stopping sniffer for finding delimiter
10:02:54 | DEBUG    | Expected number: 1
10:02:54 | DEBUG    | Entity is list: False
10:02:54 | DEBUG    | Stopping sniffer for resolving column types
10:02:54 | DEBUG    | Has header True
10:02:54 | DEBUG    | Schema {'input': [1], 'key': None}
10:02:54 | DEBUG    | Standardizing input single
10:02:54 | DEBUG    | Reading standard file from /tmp/ersilia-rmalz1lu/standard_input_file.csv
10:02:54 | DEBUG    | Schema available in /root/eos/dest/eos4se9/api_schema.json
11:21:48 | DEBUG    | Status code: 200
11:21:48 | DEBUG    | Schema available in /root/eos/dest/eos4se9/api_schema.json
12:38:09 | DEBUG    | Status code: 200
13:51:55 | DEBUG    | Status code: 200
15:04:01 | DEBUG    | Status code: 200
15:31:14 | DEBUG    | Status code: 200
15:31:14 | DEBUG    | Done with unique posting
HellenNamulinda commented 11 months ago

Hello @PromiseFru, Thank you for your efforts in running the model.

For models fetched from docker(default for ersilia CLI), one thing to note is that some models are computationally intensive(Requiring up to 16GB RAM to work effectively). Status code: 504 error suggests that the docker container may not be responding to requests in a timely manner.

Since the model is pulled from docker, consider increasing the RAM in your docker desktop to 16GB. By default, docker desktop is to use up to 2 GB of your host's memory. To increase the RAM, Go to Settings > Resources > Advanced

Another option is for you to fetch the model from GitHub, by adding the --from_github flag in the command (ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1)

Also to avoid running(making predictions) for long hours, I suggest you consider the first 10(or 50) molecules in the EML dataset and use those for comparison.

Please share the specifications of your Machine for further help.

PromiseFru commented 11 months ago

Hello @PromiseFru, Thank you for your efforts in running the model.

For models fetched from docker(default for ersilia CLI), one thing to note is that some models are computationally intensive(Requiring up to 16GB RAM to work effectively). Status code: 504 error suggests that the docker container may not be responding to requests in a timely manner.

Since the model is pulled from docker, consider increasing the RAM in your docker desktop to 16GB. By default, docker desktop is to use up to 2 GB of your host's memory. To increase the RAM, Go to Settings > Resources > Advanced

Another option is for you to fetch the model from GitHub, by adding the --from_github flag in the command (ersilia -v fetch eos4se9 --from_github > eos4se9_fetch_github.log 2>&1)

Also to avoid running(making predictions) for long hours, I suggest you consider the first 10(or 50) molecules in the EML dataset and use those for comparison.

Please share the specifications of your Machine for further help.

Hello @HellenNamulinda,

Thank you for your assistance ๐Ÿค— . I'm using Docker CLI and don't have Docker Desktop installed. On Linux, the Docker installation automatically grants Docker containers full access to the host resources. To further clarify this, I checked how many resources were allocated to the eos4se9_c020 Docker container:

docker container stats eos4se9_c020

-----

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O       PIDS
667d7bf3f1db   eos4se9_c020   0.01%     144.7MiB / 15.59GiB   0.91%     39kB / 10.3kB   168MB / 201kB   10

The result indicates that the container has a memory limit of 15.59GB, which matches my system's RAM capacity, suggesting that it has full access to my system's RAM.

Fetching from GitHub is a valid option, as you suggested. I'll definitely try that when I recharge my internet data bundle later in the day.

I wasn't aware that I could reduce the EML dataset to speed up predictions, as I was trying not to modify the file. However, thank you for the information. Reducing the dataset will indeed decrease prediction time, helping me obtain results faster.

System Specification

Memory: 16.0GB Processor: Intelยฎ Coreโ„ข i5-2500S CPU @ 2.70GHz ร— 4 Graphics: Mesa Intelยฎ HD Graphics 2000 (SNB GT1) / AMDยฎ Turks

PromiseFru commented 11 months ago

Compare the Results of the Original STOUT Model with the Ersilia STOUT Model (eos4se9)

After several days of attempting to run predictions with the Ersilia model locally, I finally succeeded in making the predictions using the method I described here. Following a suggestion from @HellenNamulinda, I reduced the EML dataset to 50 entries to decrease prediction time.

I created a Python script to compare the outputs from the two models: the Ersilia STOUT Model (eos4se9) and the Original STOUT model.

import csv

def read_csv_to_dict(file_path, compare_column):
    data = {}
    with open(file_path, newline="", encoding="utf-8") as csvfile:
        reader = csv.DictReader(csvfile)
        for index, row in enumerate(reader, 1):
            data[index] = row[compare_column]
    return data

def compare_csv_files(file1, file2, compare_column1, compare_column2):
    data1 = read_csv_to_dict(file1, compare_column1)
    data2 = read_csv_to_dict(file2, compare_column2)
    drug_data2 = read_csv_to_dict(file2, "drugs")

    col_differences = {}
    for index, value1 in data1.items():
        value2 = data2.get(index)
        drug_name = drug_data2.get(index)
        if value1 != value2:
            col_differences[index] = (value1, value2, drug_name)

    return col_differences

def write_md_diff_file(output_file, differences, compare_column1, compare_column2):
    with open(output_file, "w", encoding="utf-8") as md_file:
        md_file.write(
            f"| Index | Ersilia_STOUT_Prediction | STOUT Prediction | Drug Name |\n"
        )
        md_file.write("|-------|------------|------------|-----------|\n")

        for index, (value1, value2, drug_name) in differences.items():
            md_file.write(f"| {index} | {value1} | {value2} | {drug_name} |\n")

if __name__ == "__main__":
    file1 = "ersilia_eml_canonical_IUPAC_predictions.csv"
    file2 = "eml_canonical_IUPAC_predictions.csv"
    output_file = "differences.md"

    compare_column1 = "iupacs_names"
    compare_column2 = "iupac"

    differences = compare_csv_files(file1, file2, compare_column1, compare_column2)
    write_md_diff_file(output_file, differences, compare_column1, compare_column2)

    print("Differences have been written to 'differences.md'")

Here are the results

Index Ersilia_STOUT_Prediction STOUT Prediction Drug Name
1 [(1R,4R)-4-[2-amino-4-(cyclopropylamino)-4H-purin-9-yl]cyclopent-2-en-1-yl]methanol [(1S,4R)-4-[2-amino-6-(cyclopropylamino)purin-9-yl]cyclopent-2-en-1-yl]methanol abacavir
2 (1S,2S,5S,10R,11R,14S)-5,11-dimethyl-5-pyridin-3-yltetracyclo[9.4.0.02,6.010,14]pentadeca-7,16-dien-14-ol (3S,8R,9S,10R,13S,14S)-10,13-dimethyl-17-pyridin-3-yl-2,3,4,7,8,9,11,12,14,15-decahydro-1H-cyclopenta[a]phenanthren-3-ol abiraterone
3 N-[5-[amino(dioxo)-ฮป6-thia-3,4-diazacyclopent-2-en-2-yl]acetamide N-(5-sulfamoyl-1,3,4-thiadiazol-2-yl)acetamide acetazolamide
8 2-[(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]oxy-1,1-dithiophen-2-ylethanol [(3R)-1-(3-phenoxypropyl)-1-azoniabicyclo[2.2.2]octan-3-yl]2-hydroxy-2,2-dithiophen-2-ylacetate aclidinium
9 (E)-N-[6-[[(3-chloro-4-fluorocyclohexa-1,4-dien-1-yl)amino]methylidene]-3-[(3S)-oxolan-3-yl]oxycyclopenta[d]pyrimidin-2-yl]-4-(dimethylamino)but-2-enamide (E)-N-[4-(3-chloro-4-fluoroanilino)-7-[(3S)-oxolan-3-yl]oxyquinazolin-6-yl]-4-(dimethylamino)but-2-enamide afatinib
12 5-acetamido-2,4,6-triiodo-3-(1-oxoethylamino)cyclohexa-4,6-diene-1-carboxylicacid 3,5-diacetamido-2,4,6-triiodobenzoicacid amidotrizoate
13 (2S)-N-[(1R,2R,3R,5S,6R)-5-amino-2-[(2R,3R,4R,5R,6R)-3-amino-4,5,6-trihydroxyoxan-2-yl]oxy-3-[(2R,3S,4R,5R)-5-amino-1,3,4,6-tetrahydroxyhexan-2-yl]oxy-1-hydroxyoxetan-6-yl]-2-hydroxy-4-(methylamino)butanamide (2S)-4-amino-N-[(1R,2S,3S,4R,5S)-5-amino-2-[(2S,3R,4S,5S,6R)-4-amino-3,5-dihydroxy-6-(hydroxymethyl)oxan-2-yl]oxy-4-[(2R,3R,4S,5S,6R)-6-(aminomethyl)-3,4,5-trihydroxyoxan-2-yl]oxy-3-hydroxycyclohexyl]-2-hydroxybutanamide amikacin
14 3,5-diamino-2-chloro-N-(diaminomethylidene)-2H-pyrazine-6-carboxamide 3,5-diamino-6-chloro-N-(diaminomethylidene)pyrazine-2-carboxamide amiloride
15 2-butyl-3-[4-[2-(diethylamino)ethoxy]-3,5-diiodocyclohexa-1,4-dien-1-yl]chromen-4-one (2-butyl-1-benzofuran-3-yl)-[4-[2-(diethylamino)ethoxy]-3,5-diiodophenyl]methanone amiodarone
17 ethyl2-(2-aminoethoxymethyl)-4-[[3-(2-chlorophenyl)-4-methoxy-4-oxobut-2-en-2-yl]amino]cyclopenta-1,3-diene-1-carboxylate 3-O-ethyl5-O-methyl2-(2-aminoethoxymethyl)-4-(2-chlorophenyl)-6-methyl-1,4-dihydropyridine-3,5-dicarboxylate amlodipine
18 12-chloro-7-(diethylaminomethyl)-2,9-diazatricyclo[8.4.0.03,8]tetradeca-1(14),4,6,9,10,13-hexaen-6-ol 4-[(7-chloroquinolin-4-yl)amino]-2-(diethylaminomethyl)phenol amodiaquine
19 (2S,5R,6R)-5-[[(2R)-2-amino-2-(4-hydroxycyclohexa-1,3,5-trien-1-yl)acetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[4.3.0]nonane-2-carboxylicacid;tetrahydrate (2S,5R,6R)-6-[[(2R)-2-amino-2-(4-hydroxyphenyl)acetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid;trihydrate amoxicillin
20 (1S,3S,5S,7S,9R,10R,13R,18S,19R,20R,21S,22Z,24Z,26Z,28Z,30Z,32Z,34Z,36Z,38Z,40S,41R)-1-[(2S,3S,4R,5S,6R)-4-amino-3,5-dihydroxy-6-[(2R,3S,4R,5S,6R)-5-amino-3,4-dihydroxyoxan-2-yl]oxan-2-yl]oxy-3,5,7,9,10,13,18,41-octahydroxy-19,20,21-trimethyl-15-oxo-4,16,42-trioxatricyclo[37.2.1.03,5]dotetraconta-22,24,26,28,30,32,34,36,38-nonaene-40-carboxylicacid (1R,3S,5R,6R,9R,11R,15S,16R,17R,18S,19Z,21Z,23Z,25Z,27Z,29Z,31Z,33R,35S,36R,37S)-33-[(2R,3S,4S,5S,6R)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-1,3,5,6,9,11,17,37-octahydroxy-15,16,18-trimethyl-13-oxo-14,39-dioxabicyclo[33.3.1]nonatriaconta-19,21,23,25,27,29,31-heptaene-36-carboxylicacid amphotericin B
21 (2S,5R,6R)-7-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-8-oxo-4-thia-1,7-diazabicyclo[3.3.0]octane-2-carboxylicacid (2S,5R,6R)-6-[[(2R)-2-amino-2-phenylacetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid ampicillin
22 5-[3-(2-cyanopropan-2-yl)-6-(1,2,4-triazol-1-ylmethyl)cyclohexa-2,4-dien-1-yl]-2,2-dimethylbutanenitrile 2-[3-(2-cyanopropan-2-yl)-5-(1,2,4-triazol-1-ylmethyl)phenyl]-2-methylpropanenitrile anastrozole
23 (4S,6R,7S,10S,11S,14S,15S,16S,20S,23R,26S)-16,17,23,26-tetrahydroxy-7-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-11-(4-pentoxycyclohexa-2,4,6-trien-1-ylidene)-2-[[(2S,3S,4S)-3,4-dihydroxy-4-(4-hydroxycyclohexa-1,3,5-trien-1-yl)-2-[[(3S,4S,6R)-4-hydroxy-1-[(2S,3S)-3-hydroxybutan-2-yl]-2,6-dioxopiperazine-3-carbonyl]amino]butanoyl]amino]-14-methyl-2,5,12,17,24-hexazapentacyclo[24.2.2.218,21.04,10.06,14]dotriaconta-1(29),18(30),19,21(31),27,32-hexaene-3,11,13-trione N-[(3S,6S,9S,11R,15S,18S,20R,21R,24S,25S,26S)-6-[(1S,2S)-1,2-dihydroxy-2-(4-hydroxyphenyl)ethyl]-11,20,21,25-tetrahydroxy-3,15-bis[(1S)-1-hydroxyethyl]-26-methyl-2,5,8,14,17,23-hexaoxo-1,4,7,13,16,22-hexazatricyclo[22.3.0.09,13]heptacosan-18-yl]-4-[4-(4-pentoxyphenyl)phenyl]benzamide anidulafungin
24 14-[amino(oxo)methyl]-12-(4-methoxycyclohepta-2,4,6-trien-1-ylidene)-5-(2-oxopiperidin-1-yl)-4,11,12-triazatricyclo[7.3.2.14,8]pentadeca-1(13),6,8(15),10-tetraen-15-one 1-(4-methoxyphenyl)-7-oxo-6-[4-(2-oxopiperidin-1-yl)phenyl]-4,5-dihydropyrazolo[3,4-c]pyridine-3-carboxamide apixaban
25 (5R,6S)-5-(4-fluorocyclohepta-1,3,6-trien-1-yl)-6-[(1R)-1-[5,5,5-trifluoro-4-(trifluoromethyl)penta-1,3-dienyl]ethoxy]-1,2,5,6-tetrahydro-1,4,7-oxadiazocin-3-one 5-[[(2S,3R)-2-[(1R)-1-[3,5-bis(trifluoromethyl)phenyl]ethoxy]-3-(4-fluorophenyl)morpholin-4-yl]methyl]-1,2-dihydro-1,2,4-triazol-3-one aprepitant
26 arsorosooxy(oxo)arsane oxoarsanyloxyarsenic arsenic trioxide
27 (1R,4S,5R,8S,9R,10S,12S,13S)-10-methoxy-5,9-dimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane (1R,4S,5R,8S,9R,10S,12R,13R)-10-methoxy-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecane artemether
28 4-oxo-4-[(1S,4R,5S,8S,9R,10S,15S)-4,9,12-trimethyl-11,16,17,18-tetraoxatetracyclo[10.3.2.05,15.08,15]heptadecan-10-yl]butanoicacid 4-oxo-4-[[(4S,5R,8S,9R,10R,12R,13R)-1,5,9-trimethyl-11,14,15,16-tetraoxatetracyclo[10.3.1.04,13.08,13]hexadecan-10-yl]oxy]butanoicacid artesunate
29 5-(1,2-dihydroxyethyl)-4-methylidenefuran-2,3-diol 2-(1,2-dihydroxyethyl)-4,5-dihydroxyfuran-3-one ascorbic acid
30 methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylcyclohexa-2,5-dien-1-yl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate methylN-[(2S)-1-[2-[(2S,3S)-2-hydroxy-3-[[(2S)-2-(methoxycarbonylamino)-3,3-dimethylbutanoyl]amino]-4-phenylbutyl]-2-[(4-pyridin-2-ylphenyl)methyl]hydrazinyl]-3,3-dimethyl-1-oxobutan-2-yl]carbamate atazanavir
31 (3R,5R)-7-[2-(4-fluorocyclohepta-2,4,6-trien-1-ylidene)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-yl-3H-pyrrol-1-yl]-3,5-dihydroxyheptanoicacid (3R,5R)-7-[2-(4-fluorophenyl)-3-phenyl-4-(phenylcarbamoyl)-5-propan-2-ylpyrrol-1-yl]-3,5-dihydroxyheptanoicacid atorvastatin
32 5-[3-[1-[(4,5-dimethoxycyclohexa-1,3,5-trien-1-yl)methyl]-7,8-dimethoxy-2-methyl-1,3,4,6-tetrahydroisoquinolin-2-ium-2-yl]propanoyloxy]pentyl13-[4-[2-[4,5,6-trimethoxy-10-(4,5-dimethoxycyclohexa-2,4-dien-1-ylidene)cyclobut-2-en-1-yl]ethyl]-4-methyl-7-oxo-1-oxa-4-azoniacyclononan-1-yl]propanoate 5-[3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoyloxy]pentyl3-[1-[(3,4-dimethoxyphenyl)methyl]-6,7-dimethoxy-2-methyl-3,4-dihydro-1H-isoquinolin-2-ium-2-yl]propanoate atracurium
33 (9-methyl-4-oxa-9-azabicyclo[4.2.1]nonan-5-yl)3-hydroxy-2-phenylpropanoate (8-methyl-8-azabicyclo[3.2.1]octan-3-yl)3-hydroxy-2-phenylpropanoate atropine
34 [(2S,5R)-2-(carbamoyl)-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate [(2S,5R)-2-carbamoyl-7-oxo-1,6-diazabicyclo[3.2.1]octan-6-yl]hydrogensulfate avibactam
35 1-methyl-4-nitro-5-(7H-purin-6-ylsulfanyl)-4H-pyrimidine 6-(3-methyl-5-nitroimidazol-4-yl)sulfanyl-7H-purine azathioprine
36 (2R,3S,5R,6S,7R,9S)-7-[(2R,4R)-5-[[(2R,3R,4R,5R)-4,5-dihydroxy-3-methoxy-5-methyloxan-2-yl]-methylamino]-2-hydroxy-4-methylpentan-2-yl]-9-[(2R,4S,5S,6S)-4-(dimethylamino)-5-hydroxypentan-2-yl]oxy-3-ethyl-6-hydroxy-2,6-dimethyl-4-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4-methyloxan-2-yl]oxyoxonan-1-one (2R,3S,4R,5R,8R,10R,11R,13S,14R)-11-[(2S,3R,4S,6R)-4-(dimethylamino)-3-hydroxy-6-methyloxan-2-yl]oxy-2-ethyl-3,4,10-trihydroxy-13-[(2R,4R,5S,6S)-5-hydroxy-4-methoxy-4,6-dimethyloxan-2-yl]oxy-3,5,6,8,10,12,14-heptamethyl-1-oxa-6-azacyclopentadecan-15-one azithromycin
38 (1S,10S,11S,13S,14S,15S,17S)-18-chloro-14,17-dihydroxy-14-(2-hydroxyacetyl)-13,15,18-trimethyltetracyclo[8.7.1.01,6.011,15]octadeca-2,5-dien-4-one (8S,9R,10S,11S,13S,14S,16S,17R)-9-chloro-11,17-dihydroxy-17-(2-hydroxyacetyl)-10,13,16-trimethyl-6,7,8,11,12,14,15,16-octahydrocyclopenta[a]phenanthren-3-one beclometasone
40 11-[bis(2-chloroethyl)amino]-4-methyl-2,4-diazabicyclo[7.3.1]trideca-1(12),2,9-triene-3-carboxylicacid 4-[5-[bis(2-chloroethyl)amino]-1-methylbenzimidazol-2-yl]butanoicacid bendamustine
41 2-amino-N'-[(4,6-dihydroxycyclohexa-1,3,5-trien-1-yl)methyl]-3-hydroxypropanehydrazide 2-amino-3-hydroxy-N'-[(2,3,4-trihydroxyphenyl)methyl]propanehydrazide benserazide
42 (3R,6R,8R)-2,2-dimethyl-5-oxo-6-(2-phenylacetyl)-4-thia-1,7-diazabicyclo[4.3.0]nonane-3-carboxylicacid;N-benzyl-N'-(cyclohexa-2,4,6-trien-1-ylmethyl)ethane-1,2-diamine;(2S,5R,6R)-3,3-dimethyl-7-oxo-2-(2-phenylacetyl)-4-thia-1,8-diazabicyclo[4.3.0]nonane-5-carboxylicacid N,N'-dibenzylethane-1,2-diamine;(2S,5R,6R)-3,3-dimethyl-7-oxo-6-[(2-phenylacetyl)amino]-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylicacid benzathine benzylpenicillin
43 N-benzyl-2-(2-nitroimidazol-1-yl)ethanamine N-benzyl-2-(2-nitroimidazol-1-yl)acetamide benznidazole
44 phenylmethylbenzenecarboperoxoate benzoylbenzenecarboperoxoate benzoyl peroxide
45 benzylbenzenecarboxylate benzylbenzoate benzyl benzoate
47 (1S,2S,4S,5S,6S,8S,9S,17S)-17-fluoro-5,6-dihydroxy-5-(2-hydroxyacetyl)-4,6,17-trimethyltetracyclo[7.7.1.01,12.02,8]heptadeca-11,14-dien-15-one (8S,9R,10S,11S,13S,14S,16S,17R)-9-fluoro-11,17-dihydroxy-17-(2-hydroxyacetyl)-10,13,16-trimethyl-6,7,8,11,12,14,15,16-octahydrocyclopenta[a]phenanthren-3-one betamethasone
48 N-(4-cyano-5,5,5-trifluoropenta-2,4-dien-1-yl)-3-(4-fluorocyclohepta-1,3,6-trien-1-yl)sulfonyl-2-hydroxy-2-methylpropanamide N-[4-cyano-3-(trifluoromethyl)phenyl]-3-(4-fluorophenyl)sulfonyl-2-hydroxy-2-methylpropanamide bicalutamide
49 1-[3-(2-bicyclo[2.2.1]hept-5-enyl)-3-methoxy-3-phenylpropyl]piperidine 1-(2-bicyclo[2.2.1]hept-5-enyl)-1-phenyl-3-piperidin-1-ylpropan-1-ol biperiden
50 [4-[[5-(1-oxoethoxy)cyclohepta-1,3,6-trien-1-yl]-pyridin-2-ylmethyl]cyclohexa-1,5-dien-1-yl]acetate [4-[(4-acetyloxyphenyl)-pyridin-2-ylmethyl]phenyl]acetate bisacodyl

Sources

differences.md eml_canonical_IUPAC_predictions.csv ersilia_eml_canonical_IUPAC_predictions.csv

PromiseFru commented 11 months ago

1st Model Suggestion

Accelerating Prototype-Based Drug Discovery with Conditional Diversity Networks

Publication: ACM Digital Library

Published: 2018

Authors: Shahar Harel, Kira Radinsky

Source Code: GitHub - PyTorch, GitHub - TensorFlow

Dataset: Zinc Dataset

What this model does

This machine learning model helps scientists find new drugs quickly and at a lower cost. It does this by suggesting different molecules based on existing drugs and has already found some approved drugs. One of these drugs is Isoniazid, which treats tuberculosis. Here's how the model works:

  1. It starts by turning the molecule (written in SMILES notation) into numbers using the encoder function. This helps the model understand the molecule mathematically.

  2. The model then uses math to pick out important parts from these numbers, helping it understand the molecule's structure.

  3. The model also adds extra compounds to the molecule while making sure it follows some rules. This is called a "diversity layer."

  4. Finally, it uses a Recurrent Neural Network (RNN) to put the molecule together step by step, making sure it follows the rules of chemistry.

Why this matters to Ersilia

This model aligns with Ersilia's mission by making drug discovery faster and cheaper, which is especially useful for developing countries. It has also helped discover drugs for infectious diseases like tuberculosis, showing its potential to find drugs for other infectious and neglected diseases.

Code Implementation

The TensorFlow implementation needs updates and isn't ready to use. It lacks documentation, and the model checkpoints mentioned in the evaluate.py script as model-45000 are missing. It also lacks a license, and there are syntax errors in the scripts.

The PyTorch version seems more ready for use, but it lacks methods for running the model. It mainly offers ways to download SMILES data and train, as shown in the examples/example_zinc.py script. The documentation is incomplete, missing usage instructions, and its requirements might be outdated. It's licensed under the MIT license.

PromiseFru commented 11 months ago

2nd Model Suggestion

Automatic Detection of the Parasite Trypanosoma cruzi in Blood Smears Using a Machine Learning Approach Applied to Mobile Phone Images

Publication: National Library of Medicine

Published: 2022

Authors: Mauro Cรฉsar Cafundรณ Morais, Diogo Silva, Matheus Marques Milagre, Maykon Tavares de Oliveira, Thaรญs Pereira, Joรฃo Santana Silva, Luciano da F. Costa, Paola Minoprio, Roberto Marcondes Cesar Junior, Ricardo Gazzinelli, Marta de Lana, Helder I. Nakaya

Source Code: GitHub - Python

Dataset: Image Data Files

What This Model Does

This model helps find and treat Chagas disease faster. It also lowers the cost of the Chagas disease detection, usually done with a high resolution camera on a microscope. This model detects the disease using images from a mobile phone. Here's how it works:

  1. Prepare blood smear samples.

  2. Take pictures of blood smear with a mobile phone camera attached to a microscope eyepiece.

  3. The images are analyzed using a graph-based method, which helps separate the relevant parts of the image from the background. This step helps isolates the area that might contain the T. cruzi (Trypanosoma cruzi) parasite. This process is called Image Segmentation.

  4. The model extracts various characteristics or features from the segmented image. These features could include things like the shape of the parasite, its color, texture, and more. This is called Feature Extraction.

  5. The model then selects the most relevant features from those extracted. This step helps improve the model's efficiency to detect the T. cruzi parasite. This process is called Feature Selection.

  6. The model then uses these features to make a decision about whether the T. cruzi parasite is present in the image or not. It does so by comparing the features to patterns it has learned during training.

Why This Matters to Ersilia

Chagas disease is a dangerous, infectious and neglected disease. This model makes it faster and cheaper to detect, which aligns with Ersilia's mission.

Code Implementation

The code is well documented and uses the GNU General Public License v3.0 license. I was able to set it up and test as follows:

Clone the repository

git clone https://github.com/csbl-br/chagas_detection.git

Create and activate a conda environment

conda create -n ChD python==3.9
conda activate ChD

Install the required packages

conda install --file requirements.txt

Extract features from all images

cd main
python process_all_images.py

This created an output directory and added CSV files containing the extracted features for the images in the ./images directory

field0005.csv field0009.csv

Train a model

python feature_classification.py

This produced the following output and generated a graph showing the model's performance after training:

Model: SVC
Sensitivity: 0.6745
Specificity: 0.8024
Precision: 0.7774
Accuracy: 0.7377
F1-score: 0.7223
AUC: 0.7967

[[865 213]
 [359 744]]

Figure_1

PromiseFru commented 11 months ago

3rd Model Suggestion

Using Drug Descriptions and Molecular Structures for Drugโ€“Drug Interaction Extraction from Literature

Publication: Papers with Code

Published: 2020

Authors: Masaki Asada, Makoto Miwa, Yutaka Sasaki

Source Code: GitHub - Python

Dataset: Semantic Scholar, DDI

What This Model Does

This model helps scientists find better and cheaper drug combinations faster, making treatments more effective, reducing the risk of drug resistance and side effects, and speeding up patients' recovery. Here's how it works:

  1. The model takes biomedical text data as its input. This text contains information about drugs, their interactions, and related details. It also uses external drug database information from DrugBank, which contains structured information about various drugs, including their descriptions and molecular structures.

  2. The information from the drugs mentioned in the input text (target drugs) and the data from the drug database are used to create an improved drug text input (enriched input). The model then obtains their descriptions using SciBERT, a BERT model trained on large-scale biomedical and computer science text. These descriptions contain useful information about how the drugs are described in other biomedical literature.

  3. The model then obtains the molecular structure of the target drugs by using a molecular graph neural network (GNN) model. This representation captures the structural characteristics of the drugs.

  4. The model combines the enriched input, drug descriptions, and molecular structures. This combination helps the model understand how the drugs interact with each other.

  5. The model then determines whether the drug interactions create weaker or stronger treatments for the patients.

Why This Matters to Ersilia

This model's fast drug combination discovery and cost reduction can make affordable and effective treatments available in underdeveloped countries, in line with Ersilia's mission. It can also help reduce drug resistance and minimize side effects, leading to faster patient recovery.

Code Implementation

The code uses the MIT license. The requirements were not completely listed, and the source code may need to be updated. I was able to set it up and test as follows:

Clone the repository

git clone https://github.com/tticoin/DESC_MOL-DDIE.git

Create and activate a conda environment

conda create -n DESC_MOL-DDIE python=3.7
conda activate DESC_MOL-DDIE

Install the required packages

pip install rdkit-pypi
pip install torch
pip install tensorboard
pip install six
pip install tqdm
pip install transformers

Preprocess the sample dataset

python fingerprint/preprocessor.py sample/tsv none 1 sample/radius1

This will create a radius1 directory in the sample directory and add three files: config.json, corpus_dev.npy, and corpus_train.npy.

Perform DDI Extraction

cd main
python run_ddie.py \
    --task_name MRPC \
    --model_type bert \
    --data_dir ../sample/tsv \
    --model_name_or_path SCIBERT_MODEL \
    --per_gpu_train_batch_size 32 \
    --num_train_epochs 3. \
    --dropout_prob .1 \
    --weight_decay .01 \
    --fp16 \
    --do_train \
    --do_eval \
    --do_lower_case \
    --max_seq_length 128 \
    --use_cnn \
    --conv_window_size 5 \
    --pos_emb_dim 10 \
    --activation gelu \
    --desc_conv_window_size 3 \
    --desc_conv_output_size 20 \
    --molecular_vector_size 50 \
    --gnn_layer_hidden 5 \
    --gnn_layer_output 1 \
    --gnn_mode sum \
    --gnn_activation gelu \
    --fingerprint_dir ../sample/radius1 \
    --output_dir output

However, there was an error:

Traceback (most recent call last):
  File "run_ddie.py", line 55, in <module>
    from transformers import AdamW, WarmupLinearSchedule
ImportError: cannot import name 'WarmupLinearSchedule' from 'transformers'

After researching the error, a GitHub thread indicates that WarmupLinearSchedule should be changed to get_linear_schedule_with_warmup.

GemmaTuron commented 10 months ago

Hello,

Thanks for your work during the Outreachy contribution period, we hope you enjoyed it! We will now close this issue while we work on the selection of interns. Thanks again!