dib-lab / syrah

Output trusted regions from raw sequencing data.
7 stars 1 forks source link

khmer fp rate #8

Open taylorreiter opened 7 years ago

taylorreiter commented 7 years ago

for

~/sratoolkit.2.8.1-3-ubuntu64/bin/fastq-dump.2.8.1-3 -A SRR1929297 -Z | syrah -k 31 | sourmash compute - -o SRR1929297_syrah.sig
syrah!

   reading sequences and outputting solid regions until
   we have seen ~1000000 31-mers.

creating counttable using 4 x 1561744 bytes
reading sequences from stdin
# running sourmash subcommand: compute
computing signatures for files: -
Computing signature for ksizes: [31]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signatures.
... reading sequences from -
... consumed 1e+06 bases, ~553 kmers; output 6127 n bases
... consumed 2e+06 bases, ~2616 kmers; output 58892 n bases
... consumed 3e+06 bases, ~4649 kmers; output 151121 n bases
... consumed 4e+06 bases, ~6279 kmers; output 264041 n bases
... consumed 5e+06 bases, ~7644 kmers; output 388804 n bases

...

... - 1150000
... consumed 1.28e+08 bases, ~972248 kmers; output 4.56456e+07 n bases
... - 1160000
... consumed 1.29e+08 bases, ~989913 kmers; output 4.62401e+07 n bases
... - 1170000
... - 1180000
... consumed 1.3e+08 bases, ~1010360 kmers; output 4.68441e+07 n bases

reached 1000000 kmers; success! ending now.
**
** ERROR: the graph structure is too small for 
** this data set.  Increase data structure size
** with --max_memory_usage/-M.
**
** Do not use these results!!
**
** (estimated false positive rate of 0.996; max recommended 0.800)
**
khmer counting fp rate unexpectedly high - 0.996.
there is something unusual about your data.
don't trust these results.
calculated 1 signatures for 1184155 sequences in -

But sourmash and syrah scream if I put -M in their commands. I also tried looking through the syrah code to explicitly change the parameter and couldn't see how to do it.

ctb commented 7 years ago

what kind of data?

taylorreiter commented 7 years ago

Litchi RNA-seq. Anywhere from 6M paired end reads to 44M single end, all produced the same error