Output nucleotide sequences for peptide predictions

PR checklist

[X] Tag the issue(s) or milestones this PR fixes (e.g. Fixes #123, Resolves #456).
[x] Describe the changes you've made.
[x] Describe any tests you have conducted to confirm that your changes behave as expected.
[x] If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
[x] If you've added new functionality, make sure that the documentation is updated accordingly.
[x] If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.

PR Description

This PR addresses

Outputting nucleotide sequences

This is the bulk of the PR. I wanted the nucleotide sequences so we can calculate things like dn/ds and otherwise better/differently compare our predicted sequences. I had to make a lot of changes to do this, but I think the code base is now more consistent, the output files are better named and organized, and the variable pointers in the snakemake are clearer.

Put all outputs in their own `predictions` folder

I think this will be easier for users to find everything they need if it's all put here instead of buried in annotations. Also added additional output files and made notes in places we could add more if a need arises.

Variable and file naming updates

Note that the file naming scheme is somewhat consistent with convention (e.x. see here.

In summary --

fna: nucleotide FASTA file of input contig sequences. These would be transcripts from a transcriptome.
faa: protein fasta file of translated CDS sequences.
- faa: protein fasta file of translated CDS sequences for all predicted input proteins.
- parent_faa: protein fasta file of translated CDS sequences for parent proteins of cleavage peptides.
- peptide_faa: protein fasta file of translated predicted peptide sequences
ffn: nucleotide fasta file of CDS sequences.
- ffn: nucleotide fasta file of all CDS sequences for all predicted input CDSs
- parent_ffn: nucleotide fasta file of CDS sequences for parent CDSs of cleavage peptides
- peptide_ffn: nucleotide fasta file of predicted peptide sequences

Tests

I ran the following code and visually inspected the results by eye to confirm they are doing the correct thing

conda activate sandbox # only requires biopython
cat demo/contigs_* > demo/tmp.fa
python  scripts/extract_plmutils_nucleotide_sequences.py -n demo/tmp.fa -p outputs/demo/sORF/plmutils/peptides.faa -o tmp.fna

# only requires biopython
conda activate sandbox
python scripts/extract_deeppeptide_sequences.py outputs/demo/cleavage/deeppeptide/peptide_predictions.json demo/orfs_amino_acids.faa demo/orfs_nucleotides.fa tmp.faa tmp.fna tmp_pep.faa tmp_pep.fna tmp_predictions.tsv

# requires nlpprecursor & biopython etc.
conda activate .snakemake/conda/c111fa2b8a9cfe0d0c6028d0ebe9b492_
python scripts/run_nlpprecursor.py inputs/models/nlpprecursor/models/ outputs/demo/cleavage/preprocessing/noasterisk_nononstandardaa.faa demo/orfs_nucleotides.fa tmp.faa tmp.fna tmp_pep.faa tmp_pep.fna tmp_predictions.tsv

it would be good to have some automated testing set up, but that's another days project that will be worth the time sink if our initial results are promising for peptigate.

In testing this stuff, I noticed that I had taken the same number of lines from my big input files to make the demo files, essentially truncating the nucleotide ORFs so there were fewer sequences than in the proteins and the sequences didn't all match. It's fixed now!

Arcadia-Science / peptigate