ddunlap4 / ea-utils

Automatically exported from code.google.com/p/ea-utils
0 stars 0 forks source link

Giant FASTQ support in stats #33

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Some stats programs have things like kmer (with -K) reports and probe-id 
counting (with -D).

These programs can consume a lot of RAM (>10GB), even with the highly efficient 
sparsehash library on very large files (> 200 mil reads).

The use of a disk-backed key-value store, like levelDB could see decent 
performance, like a hash, but would also allow growth past available RAM with 
decent performance.   I'm thinking that the code should switch to a DB-backed 
store at the 200 mil record level.   This would slow things down by about 3x 
(from 1 mil writes/sec to 300k writes/sec), but would also allow infinte 
growth.  Enabling a large LRU cache could it perform so similarly that the 
sparse hash can be abandoned, especially if the db remains an insigificant 
fraction of the stats collection process.   

Original issue reported on code.google.com by earone...@gmail.com on 9 Jul 2014 at 2:26

GoogleCodeExporter commented 8 years ago

Original comment by earone...@gmail.com on 9 Jul 2014 at 2:27

GoogleCodeExporter commented 8 years ago
LevelDB was 50x slower.   So sad.   Some optimizations were done to reduce 
memory use.      Need to look at more options.

Original comment by earone...@gmail.com on 20 Aug 2014 at 5:09

GoogleCodeExporter commented 8 years ago
Going to do this by a) allowing detection of a pre-sorting by probe-id when run 
with -D ... if detected... RAM is freed and duplication detection proceeds 
without the need for a hash.   Other hashes (like kmers) can bw switched to 
some sort of counting bloom filter

Original comment by earone...@gmail.com on 8 Sep 2014 at 8:16