czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

Output only noncoding reads in --noncoding-nucleotide-fasta #104

Open taylorreiter opened 3 years ago

taylorreiter commented 3 years ago

Hello! I have a request/suggestion for the behavior of --noncoding-nucleotide-fasta. I'm currently trying to assess whether orpheum works on bacterial sequences, and one metric I'm using to assess accuracy is the % of noncoding nucleotide reads that still map to a reference CDS set (e.g., all genes in a genome). I noticed that the reads output in --noncoding-nucleotide-fasta have two problems for this use case. 1) reads are repeated because all coding frames are output. Only the read name changes however, the nucleotides remain the same. It would be more helpful to have each read occur in the file only once. 2) reads that are coding, but in a different translation frame, are still output. This isn't really helpful, as the read contains coding nucleotides, just in a different frame than is indicated by the read name. It would be more helpful if only reads that contain no coding sequences in any ORF were output to this file.

I can see where the way that --noncoding-nucleotide-fasta currently behaves could be useful for other use cases, would it maybe be possible to keep the current implementation, as well as support another, possibly by a flag like --output-all-translation-frames or something like that.

Phrasing this in a different way, I am requesting that --noncoding-nucleotide-fasta (or some other flag) contain only the number of reads indicated as "Non-coding" by the "categorization_counts" section of the summary json file output by orpheum

  "categorization_counts": {
    "Translation is shorter than peptide k-mer size + 1": 18,
    "Translation frame has stop codon(s)": 9486,
    "Coding": 109672,
    "Non-coding": 48419,
    "Low complexity nucleotide": 0,
    "Read length was shorter than 3 * peptide k-mer size": 0,
    "Low complexity peptide in protein20 alphabet": 0
  },

Thinking about this more, it would be really cool from a benchmarking perspective if there was a flag that would enable each read set in the summary json file to be output into it's own file. So, each of the categories below would end up as a file. That would be incredibly useful.

  "categorization_counts": {
    "Translation is shorter than peptide k-mer size + 1": 18,
    "Translation frame has stop codon(s)": 9486,
    "Coding": 109672,
    "Non-coding": 48419,
    "Low complexity nucleotide": 0,
    "Read length was shorter than 3 * peptide k-mer size": 0,
    "Low complexity peptide in protein20 alphabet": 0
  },
  "histogram_n_coding_frames_per_read": {
    "Number of reads with 1 putative protein-coding translations": 77100,
    "Number of reads with 2 putative protein-coding translations": 29707,
    "Number of reads with 4 putative protein-coding translations": 103,
    "Number of reads with 3 putative protein-coding translations": 2762
  },