[X] Tag the issue(s) or milestones this PR fixes (e.g. Fixes #123, Resolves #456).
[x] Describe the changes you've made.
[x] Describe any tests you have conducted to confirm that your changes behave as expected.
[x] If you've added new software dependencies, make sure that those dependencies are included in the appropriate conda environments.
[x] If you've added new functionality, make sure that the documentation is updated accordingly.
[x] If you encountered bugs or features that you won't address, but should be addressed eventually, create new issues for them.
PR Description
This PR addresses
21
28
9
Outputting nucleotide sequences
This is the bulk of the PR. I wanted the nucleotide sequences so we can calculate things like dn/ds and otherwise better/differently compare our predicted sequences. I had to make a lot of changes to do this, but I think the code base is now more consistent, the output files are better named and organized, and the variable pointers in the snakemake are clearer.
Put all outputs in their own predictions folder
I think this will be easier for users to find everything they need if it's all put here instead of buried in annotations. Also added additional output files and made notes in places we could add more if a need arises.
Variable and file naming updates
Note that the file naming scheme is somewhat consistent with convention (e.x. see here.
In summary --
fna: nucleotide FASTA file of input contig sequences. These would be transcripts from a transcriptome.
faa: protein fasta file of translated CDS sequences.
faa: protein fasta file of translated CDS sequences for all predicted input proteins.
parent_faa: protein fasta file of translated CDS sequences for parent proteins of cleavage peptides.
peptide_faa: protein fasta file of translated predicted peptide sequences
ffn: nucleotide fasta file of CDS sequences.
ffn: nucleotide fasta file of all CDS sequences for all predicted input CDSs
parent_ffn: nucleotide fasta file of CDS sequences for parent CDSs of cleavage peptides
peptide_ffn: nucleotide fasta file of predicted peptide sequences
Tests
I ran the following code and visually inspected the results by eye to confirm they are doing the correct thing
it would be good to have some automated testing set up, but that's another days project that will be worth the time sink if our initial results are promising for peptigate.
In testing this stuff, I noticed that I had taken the same number of lines from my big input files to make the demo files, essentially truncating the nucleotide ORFs so there were fewer sequences than in the proteins and the sequences didn't all match. It's fixed now!
PR checklist
Fixes #123, Resolves #456
).conda
environments.PR Description
This PR addresses
21
28
9
Outputting nucleotide sequences
This is the bulk of the PR. I wanted the nucleotide sequences so we can calculate things like dn/ds and otherwise better/differently compare our predicted sequences. I had to make a lot of changes to do this, but I think the code base is now more consistent, the output files are better named and organized, and the variable pointers in the snakemake are clearer.
Put all outputs in their own
predictions
folderI think this will be easier for users to find everything they need if it's all put here instead of buried in
annotations
. Also added additional output files and made notes in places we could add more if a need arises.Variable and file naming updates
Note that the file naming scheme is somewhat consistent with convention (e.x. see here.
In summary --
Tests
I ran the following code and visually inspected the results by eye to confirm they are doing the correct thing
it would be good to have some automated testing set up, but that's another days project that will be worth the time sink if our initial results are promising for peptigate.
In testing this stuff, I noticed that I had taken the same number of lines from my big input files to make the demo files, essentially truncating the nucleotide ORFs so there were fewer sequences than in the proteins and the sequences didn't all match. It's fixed now!
Software dependencies
Nothing new, all in conda envs.
Documentation
punt 👀
Related issues/things I won't address right now
29
30