ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

✍️ Contribution period: Girisha Sahdev #622

Closed girishatechie closed 1 year ago

girishatechie commented 1 year ago

Week 1 - Get to know the community

Week 2 - Install and run an ML model

Week 3 - Propose new models

Week 4 - Prepare your final application

girishatechie commented 1 year ago

Hi @GemmaTuron ! I've filled the Model Suggestion Form! Sure, I am on it, I'll try to implement the Pretrained Covid Model and give updates on the Inputs and the Output! Thank you so much!

girishatechie commented 1 year ago

Hi @GemmaTuron !

I tried implementing the Model.

I cloned the repository and manually downloaded the Dataset Files (.tar) for both DrugBank as well as KIBA Dataset. The Dependencies include:

pytorch==1.8.1 pytorch-geometric==1.7.1 pytorch-lightning==1.3.6

This Model can only be tested with a PyTorch-Geometric Version <2.0.0 I installed all the dependencies, by using Conda, in the server directory of DeepDrug, itself. The Processed Input Datasets can be easily downloaded from here

For each input (drug or protein), we need to use sequence data as well as the partially available structure profile data as separate input branch to the DeepDrug model. DeepChem was used for converting drug SMILES strings into graph representations in the form of feature matrices and adjacency matrices. For Drugs, as given in the repository, the structural features are constructed by: from dataset import EntryDataset drug_df = pd.read_csv('drug.csv') save_folder = '/path/to/drug/graph/' dataset = EntryDataset(save_folder) dataset.drug_process(drug_df)

For Proteins, the structural features are constructed by the PAIRPred software

For the Covid Model, Two drug-target positive datasets for SARS-CoV-2 are used, which provide a literature based and an expert-confirmed list of drugs and target proteins for SARS-CoV-2 with 42 and 34 pairs respectively. (positive dataset, i.e., “pos.expert” and “pos.literature") Corresponding to this, random pairs of the same drugs and SARS-CoV-2 proteins (negative dataset, i.e., “neg.drugbank”) is used. Two approaches are used for the graph features construction of SARS-Cov-2 proteins. Firstly, the simulation structures of SARS-Cov-2 proteins are provided in SARS-CoV-2 3D database. Secondly, for each protein in the SARS-CoV-2, the most similar templates in the RCSB database are used as the crystal structure of the protein, which are also provided in the SARS-CoV-2 3D database. The performance of DeepDrug in each cross-validation fold is found. For the Output, The performance of DeepDrug in each cross-validation fold is shown and the affinity scores are calculated. Five-fold cross-validation pretrained models are used to make binding affinity predictions and final prediction for each drug-target pair is obtained by taking the mean and maximum values of the predictions for these 5 models. The results of the Affinity Prediction Scores are finally calculated. The results showed that DeepDrug was able to distinguish expert-confirmed positive pairs from negative pairs in both of Mean and Maximum strategies. The Output is basically obtained in the form of Final Affinity Prediction Scores : Mean as well as Maximum, indicating that DeepDrug is able to correctly identify the interactions of SARS-CoV-2 proteins. DeepDrug assigned higher prediction scores for the interacting pairs, as compared to the non-interacting pairs. There were some outliers with very high affinity in the predictions of the negative pairs (that were randomly formed), which could be valid potential drugs.

Hence, Based on the Affinity Prediction Scores of the drug-target pairs, given as outputs, which are constructed from the Processed Datasets, given as the input, DeepDrug helps in identifying potential valid drugs (from among the highest-scorer drug-target pairs) against SARS-COV-2 Infection, which can be effective for treating COVID Patients and are strongly activated by the SARS-CoV-2 Infection.

For the actual implementation of this Model, I ran this command: (by changing the directory to the DeepDrug Folder's Location, in the same environment wherein I had installed the PyTorch dependencies, using conda) python deepdrug.py --configfile ./config/KIBA.regression.yml

The Pretrained Covid Model is a bit tough to implement, mainly because of the dependencies' version requirements and Machine constraints (requires GPU). Inputs don't seem to be a major problem. However, it is implementable and not that complex. Took me a few hours to properly understand the workflow and inputs/outputs of the Model.

girishatechie commented 1 year ago

Hi @GemmaTuron ! I was just going through some more articles on Biorxiv, and I have another Model Suggestion, It seems to be easily implementable and might have good relevance to Ersilia's Mission. It would be great, if you could check this for once! I can try implementing this model as well, to check the results. Thank you!

Here's it:

MODEL

LEP-AD: Language Embedding of Proteins and Attention to Drugs predicts Drug-Target interactions

SLUG

lep-ad-dti

PUBLICATION

https://www.biorxiv.org/content/10.1101/2023.03.14.532563v1.full

SOURCE CODE

https://github.com/adaga06/LEP-AD

DESCRIPTION

LEP-AD: Language Embedding of Proteins and Attention to Drugs predicts drug-target interactions. It combines pre-trained ESM-2 and Transformer-GCN models, predicting binding affinity values.

SUMMARY

The prediction of binding affinity, or the binding strength between a drug and its target in the body, is a crucial aspect of drug discovery and development. Accurate binding affinity predictions can help identify the most promising drug candidates, optimise the design of new drugs, and reduce the cost and time of drug development. LEP-AD is a Transformer protein language model for drug-target interaction predictions, that outperforms state-of-the-art methods. It combines pre-trained ESM-2 and Transformer-GCN models, predicting binding affinity values. It is a pre-trained model and scales favourably in performance with the size of training data. LEP-AD can serve as a valuable tool for drug discovery and development, providing insights into molecular mechanisms of drug-target interactions and guiding the selection of drug candidates for clinical trials. It combines a deep latent embedding of proteins using a language model with a graph-based representation of drugs with attention, as computed in a Transformer model. The output is predicted as Binding Scores.

TAGS

ToxCast

TASK

Regression

GemmaTuron commented 1 year ago

Hi @girishatechie !

Thanks for the work, here a few comments:

GemmaTuron commented 1 year ago

@girishatechie !

While you work on your application, can we continue working on this issue https://github.com/ersilia-os/ersilia/issues/368 -- can you write there what is the status and what are the persisting errors?

thanks

girishatechie commented 1 year ago

Hi @girishatechie !

Thanks for the work, here a few comments:

  • The SARS-CoV2 model looks very relevant but at this moment the Ersilia Model Hub cannot accept other inputs that are not SMILES or text - hence genomic sequences are still out of scope, we will have to leave it out temporarily
  • `The protein-drug interaction model looks very interesting, I see it's still pretty much in development (latest commits from Feb) I don't like that the repo has very few details, but I'll keep an eye on it!

Noted! @GemmaTuron Thank you so much for the feedback! :)

girishatechie commented 1 year ago

@girishatechie !

While you work on your application, can we continue working on this issue #368 -- can you write there what is the status and what are the persisting errors?

thanks

Sure, I am on it! @GemmaTuron Thank you!

girishatechie commented 1 year ago

Hi @GemmaTuron ! I've tried re-testing the Model eos9be7, using both Ersilia CLI and Colab, and I have mentioned the Status as well as the errors, in the issue #368.

Please check, Thank you so much!

GemmaTuron commented 1 year ago

Thanks @girishatechie I have answered there! While you prepare your final application, I'd have one last question for you Some models have speficied the +cpu version of tensorflow, to run in windows and linux, but this seems to be a problem in Mac users. Can you confirm if, for example, the model eos93h2 has this problem? If it does, could you try to clone the repo, change the version of torch and try to fetch the model using the --repo_path flag, so that it gets the model from your local version modified?

girishatechie commented 1 year ago

Hi @GemmaTuron ! Sure, I am on it! Thank you so much!

girishatechie commented 1 year ago

Hi @GemmaTuron !

Update on the above question:

The Model eos93h2 does not have a Tensorflow issue, but it does require a GPU Environment to be there in the local machine, for installing CUDA 10.1, as required by this particular model, which might be incompatible with a few Macs. However, for the Ersilia Model Implementation of eos93h2, this isn't a problem. The main issue is that it requires the +cpu version of PyTorch, installing which is a problem for Macs, going by the official PyTorch Documentation. Macs don't provide GPU support for installing PyTorch and that's why, installing its +cpu version, is a problem. However, other versions of PyTorch can be installed on Macs using pip/conda and I have them installed in my Mac, along with torch-vision as well as torch-geometric.

In order to get the model from my local version modified, as per your instructions, I cloned the repository of the eos93h2 model. I navigated to the location of the folder eos93h2 in my machine, and changed the version of PyTorch, in its folder, in order to make it compatible with the version, on my Local Machine. The Model's Dockerfile had the +cpu version of PyTorch listed as its dependency, and I edited it, to change the version of PyTorch so that it runs the pip command which is compatible with my machine's local version of PyTorch. Afterwards, I tried to fetch the model, in verbose mode itself, by using the --repo_path flag, wherein I passed its pathname, as on my machine, as the argument. Although this was successfully able to run the edited pip command, wherein I had changed the PyTorch Version and it's no longer giving a PyTorch compatibility Error, going by the log file. But, It gives a different error. I tried to follow many different steps to resolve this error initially, but now I have gained much more clarity on why is it giving that particular error.

The Error:


Error message:

expected str, bytes or os.PathLike object, not NoneType


I navigated through the log file, to know more about the source of this error. This error occurs in the Line 9, while executing the pack.py file (in the Folder of eos93h2). This is the line:

os.path.join(root, "model", CHECKPOINTS_BASEDIR),

This error generally occurs when we pass an unsupported datatype to the os.path() function, since it only accepts either strings, or bytes, or objects that follow the os.PathLike Protocol. As per the error, we are passing a NoneType Argument/ file object. I tried to rectify it, by going through a few GitHub Issues and discussions over the internet, I changed the type to str format, in the pack.py file and saved it. I also tried to use os.fspath() instead of os.path(), so that it accepts this argument, but it still gave the same error, upon fetching the model using the --repo_path flag.

Since, this indicates some discrepancy with the Model Checkpoints folder, I manually downloaded the checkpoints (.pth files), placed them in the directory manually and then tried to fetch the model using the --repo_path flag, but it still gave the same error. These checkpoints are stored with Git LFS, so when I try to fetch this particular model without using the --repo_path flag, it uses Git LFS to clone these, and shows a message indicating some discrepancy with the Model Checkpoints' Files.

What I have inferred from these: According to me, this is because these Models are downloaded as .pth files, which are basically created and trained using certain specific versions of PyTorch, hence, even after using the --repo_path flag and by changing the version of PyTorch as per my own local version, to get it modified, it is not able to fetch the model properly and according to the error message, it reads these files as None Type, because the PyTorch version using which these are created, is incompatible with the local versions and hence, it isn't able to find any matching distribution, in my local machine. This might be an issue, only with such models requiring the +cpu version of PyTorch. That's why, this is not able to resolve the particular error.

GemmaTuron commented 1 year ago

thanks @girishatechie for the detailed explanation So, for those models requiring the +cpu flag in the pytorch verison, we are not able to find a specific version that works in Mac.. that's a pity We'll need to move to docker containers which hopefully will make models portable between operating systems

girishatechie commented 1 year ago

Thank you! @GemmaTuron Yes, using Docker containers should certainly be able to solve this problem and make such models portable across all the operating systems!

GemmaTuron commented 1 year ago

yes, we are on the way of making the move, see #546

girishatechie commented 1 year ago

I just went thoroughly through this issue #546, it's really nice and inspiring that this is in the process and a good amount of progress has already been made, towards containerising all the models and storing them in the DockerHub! :)