Closed taylorreiter closed 8 months ago
- should the models that you received from the autopeptideml authors be included in the repo (assuming they're not too large)? alternatively, we could host the models somewhere else (S3 or another github repo) and then download them in the snakefile.
The model folders are ~400mb, so I didn't upload them here. My hope is that the person who shared them with me will make them available for download soon, and i'll incorporate them with a download link in the pipeline then. My plan is to punt on putting them anywhere til this happens, but I put some comments into the snakefile as reminders to do that. If the authors don't make them available for download soon, I'll put them on OSF for download.
- the use of autopeptideml here is a bit inefficient because it re-generates the ESM embeddings for each of the 12 named models. For now this is probably okay, but it may be worth optimizing if the dataset of combined peptide predictions that are input to autopeptideml becomes large (I would guess larger than ~10,000 sequences).
This is a really good point. I'll make an issue for this. It might be worth just running all twelve models in the same script (I think it probably is, but will make an issue an think on it more!)
- we should look into the implications of snakemake parallelizing processes that use the GPU (in this case, all of the autopeptideml models). I assume that this is handled in a sensible way at the level of CUDA or the GPU itself, but I'm not sure.
Also a great point, I'll add it to the issue.
PR checklist
conda
environments.PR Description
This PR adds a rule to run the binary classifier AutoPeptideML. I choseto use the models that the authors trained in their preprint, however as added into a docstring, we could instead use the labels in the peptipedia database and train new models in a separate snakefile (like the nrps one) and then make them available for download. I prefer using the models they built in their preprint bc they and other experts put thought into the labels and use cases.
The models were supplied to me by the author of the paper via email. They said they are working on a solution to make the available/downloadable, so I added a TODO item to a rule to download when I can.
The output of the script looks like this (first few lines), where the
AB
column is the name of the model and the value is the prediction of that bioactivity.Testing
The changes run successfully on the demo data set and I confirmed that pytorch can find the GPU in the snakemake-built conda environment.
Documentation
punt again...but getting very close to actually doing this!
next PR
My next PR will clean up some of the issues with peptide header names and collect all of the annotation information produced since the peptipedia PR.
Update
I'm working on a summary script to put together all of the annotation data, which I'm in part hoping to use to determine whether a peptide is real or not. As part of this, I was looking at the autopeptideml predictions, and they look something like this:
This feels somewhat concerning because there are so many predictions, it's certainly over predicting. Since this isn't a labelled dataset (it's just the first 200 rows of transcripts from the Amblyomma transcriptome), we don't know the ground truth here. However, imagine being presented with this information...what do you do with it?? I was sort of hoping that there wouldn't be quite so much overprediction, and we could use this information as a filter for peptides that are more likely to be real. I don't think we can do that now, but I do think this is still worth including.
I'm going to start an issue on thinking through how to filter down to peptides that are potentially real.