czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

orpheum translate assigns all reads to non-coding with a jaccard index of 0 #103

Open taylorreiter opened 3 years ago

taylorreiter commented 3 years ago

Environment (orpheum version 1.0.5.dev22+ga76c3f3, sourmash version 4.2.1, khmer version 2.1.1)

channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - sourmash
   - pip
   - pip:
       - git+https://github.com/czbiohub/orpheum@master

Code:

orpheum index --molecule protein --peptide-ksize 10 --save-as tmp_pan_genome_reference_k10.nodegraph pan_genome_reference.faa

orpheum translate --peptide-ksize 10  --peptides-are-bloom-filter --noncoding-nucleotide-fasta tmp.nuc_noncoding --coding-nucleotide-fasta tmp.nuc --csv tmp.csv --json-summary tmp.json tmp_pan_genome_reference_k10.nodegraph ../outputs/rgnv_sgc_original_results/4000_GCF_900036035.1_RGNV35913_genomic.fna.gz.cdbg_ids.reads.fa.gz > tmp.pep

Also tried it removing --peptides-are-bloom-filter and feeding it the protein sequences directly.

I ran these commands on ~600 fastq files, and they all produced empty *pep files and full *nuc_noncoding files. All of the reads in nuc noncoding have a jaccard of 0

>SRR2145359.148924__HWI-ST431:257:C1DT9ACXX:6:1101:8013:97867/1__translation_frame:2__jaccard:0.0
AATATTCTCACAACTTACAGACCAGTTCTTGGTCATCAGGAACTATGCGAAAATAACGAATGGTTTGTTTGCCAGGATCATTGGCAAAATTAGTGCACTTA
>SRR2145359.174446__HWI-ST431:257:C1DT9ACXX:6:1102:18950:21974/2__translation_frame:1__jaccard:0.0
ACAATATTCTCACAACTTACAGACCAGTTCTTGGTCATCAGGAACTATGCGAAAATAACGAATGGTTTGTTTGCCAGGATCATTGGCAAAATTAGTGCAC
>SRR2145359.175662__HWI-ST431:257:C1DT9ACXX:6:1102:17565:22992/1__translation_frame:3__jaccard:0.0
TATTGCAGAATGGTAAGTGCACTAATTTTGCCAATGATCCTGGCAAACAAACCATTCGTTATTTTCGCATAGTTCCTGATGACCAAGAACTGGTCTGTAAG
>SRR2145359.175662__HWI-ST431:257:C1DT9ACXX:6:1102:17565:22992/1__translation_frame:-1__jaccard:0.0
TATTGCAGAATGGTAAGTGCACTAATTTTGCCAATGATCCTGGCAAACAAACCATTCGTTATTTTCGCATAGTTCCTGATGACCAAGAACTGGTCTGTAAG
>SRR2145359.237474__HWI-ST431:257:C1DT9ACXX:6:1102:10897:63195/2__translation_frame:-3__jaccard:0.0
AGTGCACTAATTTTGCCAATGATCCTGGCAAACAAACCATTCGTTATTTTCGCATAGTTCCTGATGACCAAGAACTGGTCTGTAAGTTGTGAGAATATTG
>SRR2145359.253930__HWI-ST431:257:C1DT9ACXX:6:1102:8178:73159/1__translation_frame:-3__jaccard:0.0
ACTAATTTTGCCAATGATCCTGGCAAACAAACCATTCGTTATTTTCGCATAGTTCCTGATGACCAAGAACTGGTCTGTAAGTTGTGAGAATATTGTCTCAA
>SRR2145359.256097__HWI-ST431:257:C1DT9ACXX:6:1102:14263:74274/1__translation_frame:3__jaccard:0.0
GAAAGAGGATTGAGACAATATTCTCACAACTTACAGACCAGTTCTTGGTCATCAGGAACTATGCGAAAATAACGAATGGTTTGTTTGCCAGGATCATTGGC
>SRR2145359.257134__HWI-ST431:257:C1DT9ACXX:6:1102:7530:75048/1__translation_frame:2__jaccard:0.0
TAATGAAGTTTACGTATTGCAGAATGGTAAGTGCACTAATTTTGCCAATGATCCTGGCAAACAAACCATTCGTTATTTTCGCATAGTTCCTGATGACCAAG

The protein db I'm working with has 37K protein sequences in it, half of which came directly from fastq files I ran orpheum against (e.g., I megahit assembled the fastq files, prokka predicted protein seqs, and add them to a final fasta file of all of my protein sequences). So both with a k of 7 and 10 I expect many matches. I'm not sure what I'm doing wrong here. Any help would be greatly appreciated!

I'm attaching my db of protein sequences as well as one of my read files. pan_genome_reference.faa.gz 4000_GCF_900036035.1_RGNV35913_genomic.fna.gz.cdbg_ids.reads.fa.gz

taylorreiter commented 3 years ago

I solved this issue! The problem stemmed from * that encoded stop codons. Removing them via this script solved the issue.

wget https://raw.githubusercontent.com/spacegraphcats/2018-paper-spacegraphcats/master/pipeline-base/scripts/remove-stop-plass.py
python remove-stop-plass.py pan_genome_reference.faa