Open taylorreiter opened 3 years ago
I solved this issue! The problem stemmed from *
that encoded stop codons. Removing them via this script solved the issue.
wget https://raw.githubusercontent.com/spacegraphcats/2018-paper-spacegraphcats/master/pipeline-base/scripts/remove-stop-plass.py
python remove-stop-plass.py pan_genome_reference.faa
Environment (orpheum version 1.0.5.dev22+ga76c3f3, sourmash version 4.2.1, khmer version 2.1.1)
Code:
Also tried it removing
--peptides-are-bloom-filter
and feeding it the protein sequences directly.I ran these commands on ~600 fastq files, and they all produced empty
*pep
files and full*nuc_noncoding
files. All of the reads in nuc noncoding have a jaccard of 0The protein db I'm working with has 37K protein sequences in it, half of which came directly from fastq files I ran orpheum against (e.g., I megahit assembled the fastq files, prokka predicted protein seqs, and add them to a final fasta file of all of my protein sequences). So both with a k of 7 and 10 I expect many matches. I'm not sure what I'm doing wrong here. Any help would be greatly appreciated!
I'm attaching my db of protein sequences as well as one of my read files. pan_genome_reference.faa.gz 4000_GCF_900036035.1_RGNV35913_genomic.fna.gz.cdbg_ids.reads.fa.gz