[x] Tag the issue(s) or milestones this PR fixes (e.g. Fixes #123, Resolves #456).
[x] Describe the changes you've made.
[x] Describe any tests you have conducted to confirm that your changes behave as expected.
[x] If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
PR Description
Addresses #8 (note i had to add the start and end bp to make sure fasta headers are unique for outputs from different tools, as detailed in the issue)
Addresses #11 by filtering out NONRIPP predictions
refactors the run_nlpprecursor.py script so that its broken up into functions instead of a bunch of code that lives under main.
tweaks scripts/names of things to make them consistent across the pipeline (e.g. the ID of the peptide is now always referred to as peptide_id.
adds an output TSV for deeppeptide to rescue the peptide class information
outputs the peptipedia top BLAST match peptide sequence (which would be hard to recover otherwise)
combines all of the annotation information into a single output TSV file with informative headers. Right now, it only includes the cleavage peptides, but once we include sORFs and NRPS peptides, these will be added in as well.
Note I chose to write the script to combine different annotation files in R. I think tidyverse is easier to read than pandas for dataframe manipulation. I actually tried to use chatgpt to convert my r script into a python script, and it didn't do the thing correctly so I stuck with the R script.
Testing
I tested the updates on the demo data and all run as expected
PR checklist
Fixes #123, Resolves #456
).conda
environments.PR Description
run_nlpprecursor.py
script so that its broken up into functions instead of a bunch of code that lives undermain
.peptide_id
.Note I chose to write the script to combine different annotation files in R. I think tidyverse is easier to read than pandas for dataframe manipulation. I actually tried to use chatgpt to convert my r script into a python script, and it didn't do the thing correctly so I stuck with the R script.
Testing
I tested the updates on the demo data and all run as expected
documentation
punt again :)