Arcadia-Science / peptigate

Peptigate ("peptide" + "investigate") predicts bioactive peptides from transcriptome assemblies or sets of proteins.
MIT License
1 stars 1 forks source link

Add rule to classify peptide bioactivity with the autopeptideml tool #10

Closed taylorreiter closed 8 months ago

taylorreiter commented 8 months ago

PR checklist

PR Description

This PR adds a rule to run the binary classifier AutoPeptideML. I choseto use the models that the authors trained in their preprint, however as added into a docstring, we could instead use the labels in the peptipedia database and train new models in a separate snakefile (like the nrps one) and then make them available for download. I prefer using the models they built in their preprint bc they and other experts put thought into the labels and use cases.

The models were supplied to me by the author of the paper via email. They said they are working on a solution to make the available/downloadable, so I added a TODO item to a rule to download when I can.

The output of the script looks like this (first few lines), where the AB column is the name of the model and the value is the prediction of that bioactivity.

ID      sequence        AB
Transcript_1000626.p1_NONRIPP_49_105_nlpprecursor       YYSGLVTDSRNMQGTVIKRKRQVKRCLAKVRTNKCVCLCQQRIVLQRCAATTFPSL        0.6666666666666666
Transcript_0.p1_CLASS_I_LANTIPEPTIDE_134_180_nlpprecursor       HLRTHTGECPYKCDHCDSSFFEKGNLKQHPCTHTGERPYKCDHCDS  0.3333333333333333
Transcript_100036.p2_NONRIPP_55_96_nlpprecursor RSVAEGTTLTPWKERKKAAAIVFASKRFPHLSAHSFLLPPP       0.3333333333333333

Testing

The changes run successfully on the demo data set and I confirmed that pytorch can find the GPU in the snakemake-built conda environment.

Documentation

punt again...but getting very close to actually doing this!

next PR

My next PR will clean up some of the issues with peptide header names and collect all of the annotation information produced since the peptipedia PR.

Update

I'm working on a summary script to put together all of the annotation data, which I'm in part hoping to use to determine whether a peptide is real or not. As part of this, I was looking at the autopeptideml predictions, and they look something like this:

ID sequence AB ACE ACP AF AMAP AMP AOX APP AV BBP DPPIV MRSA Neuro QS TOX TTCA total
Transcript_1000463.p1_start95_end131 HNLIAESTIGAALAVMEAMQTTYAVRGKLVVLGTPA 0.33 0.33 0.66 0 0 0.66 1 0.33 1 1 0.33 0 1 0.33 1 0.66 8.67
Transcript_100028.p1_start77_end112 LRGQSLGSVAFLDTASAYPLVDSTAGLHVSAIAPV 0 0.33 0.33 0 0 0.33 1 0 1 1 0.33 0 1 0.33 0.66 1 7.33
Transcript_1001336.p1_start33_end79 GEVGETEDLEVLASFRVSSYLVSPVIAEDSFHVTSQATSLGAAATR 0 0.66 0 0 0.33 0.33 1 0 1 1 0.33 0 1 0 1 0.66 7.33
Transcript_1000535.p1_start68_end92 MFSSNRGTVPVSLDMPFQVVRQVD 0 0.66 0 0 0 0.33 1 0 0.66 0.66 0.66 0 1 0.33 0.66 1 7
Transcript_1000655.p1_start55_end108 SYVRKLCFPEGNPVLDVEDLKHGGHYVALLPHESFKKPSSKIPNNYMRTYETL 0 0.66 0 0 0 0 1 0.33 0.66 1 0.66 0 1 0.66 0.66 0.33 7
Transcript_1.p1_start84_end120 DHIRIHTGEKPYHCHLCPMAFAQNSGLYHHLRRHKN 0.33 0 0 1 0 1 1 1 0.33 1 0 0 0 0 1 0 6

This feels somewhat concerning because there are so many predictions, it's certainly over predicting. Since this isn't a labelled dataset (it's just the first 200 rows of transcripts from the Amblyomma transcriptome), we don't know the ground truth here. However, imagine being presented with this information...what do you do with it?? I was sort of hoping that there wouldn't be quite so much overprediction, and we could use this information as a filter for peptides that are more likely to be real. I don't think we can do that now, but I do think this is still worth including.

I'm going to start an issue on thinking through how to filter down to peptides that are potentially real.

taylorreiter commented 8 months ago
  • should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.

The model folders are ~400mb, so I didn't upload them here. My hope is that the person who shared them with me will make them available for download soon, and i'll incorporate them with a download link in the pipeline then. My plan is to punt on putting them anywhere til this happens, but I put some comments into the snakefile as reminders to do that. If the authors don't make them available for download soon, I'll put them on OSF for download.

  • the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).

This is a really good point. I'll make an issue for this. It might be worth just running all twelve models in the same script (I think it probably is, but will make an issue an think on it more!)

  • we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.

Also a great point, I'll add it to the issue.