Arcadia-Science / peptigate

Peptigate ("peptide" + "investigate") predicts bioactive peptides from transcriptome assemblies or sets of proteins.
MIT License
1 stars 1 forks source link

Output nucleotide sequences for peptide predictions #31

Closed taylorreiter closed 7 months ago

taylorreiter commented 7 months ago

PR checklist

PR Description

This PR addresses

Outputting nucleotide sequences

This is the bulk of the PR. I wanted the nucleotide sequences so we can calculate things like dn/ds and otherwise better/differently compare our predicted sequences. I had to make a lot of changes to do this, but I think the code base is now more consistent, the output files are better named and organized, and the variable pointers in the snakemake are clearer.

Put all outputs in their own predictions folder

I think this will be easier for users to find everything they need if it's all put here instead of buried in annotations. Also added additional output files and made notes in places we could add more if a need arises.

Variable and file naming updates

Note that the file naming scheme is somewhat consistent with convention (e.x. see here.

In summary --

Tests

I ran the following code and visually inspected the results by eye to confirm they are doing the correct thing

conda activate sandbox # only requires biopython
cat demo/contigs_* > demo/tmp.fa
python  scripts/extract_plmutils_nucleotide_sequences.py -n demo/tmp.fa -p outputs/demo/sORF/plmutils/peptides.faa -o tmp.fna

# only requires biopython
conda activate sandbox
python scripts/extract_deeppeptide_sequences.py outputs/demo/cleavage/deeppeptide/peptide_predictions.json demo/orfs_amino_acids.faa demo/orfs_nucleotides.fa tmp.faa tmp.fna tmp_pep.faa tmp_pep.fna tmp_predictions.tsv

# requires nlpprecursor & biopython etc.
conda activate .snakemake/conda/c111fa2b8a9cfe0d0c6028d0ebe9b492_
python scripts/run_nlpprecursor.py inputs/models/nlpprecursor/models/ outputs/demo/cleavage/preprocessing/noasterisk_nononstandardaa.faa demo/orfs_nucleotides.fa tmp.faa tmp.fna tmp_pep.faa tmp_pep.fna tmp_predictions.tsv

it would be good to have some automated testing set up, but that's another days project that will be worth the time sink if our initial results are promising for peptigate.

In testing this stuff, I noticed that I had taken the same number of lines from my big input files to make the demo files, essentially truncating the nucleotide ORFs so there were fewer sequences than in the proteins and the sequences didn't all match. It's fixed now!

Software dependencies

Nothing new, all in conda envs.

Documentation

punt 👀

Related issues/things I won't address right now