dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

Feature request: compatibility wrapper for PBcR-MHAP #724

Open mr-c opened 9 years ago

mr-c commented 9 years ago

User story: PBcR-MHAP uses Jellyfish configured to request 1 TB of memory. Luiz wants to use less memory.

http://sourceforge.net/p/wgs-assembler/svn/HEAD/tree/trunk/src/AS_PBR/PBcR.pl#l1621

A single script could build a counting hash from the input sequences, calculate the cutoff value (see http://sourceforge.net/p/wgs-assembler/svn/HEAD/tree/trunk/src/AS_PBR/PBcR.pl#l1634 ), re-read the sequences to output the counts for each k-mer (possibly using a presence table to avoid over reporting).

Bonus: format the output to match http://sourceforge.net/p/wgs-assembler/svn/HEAD/tree/trunk/src/AS_PBR/PBcR.pl#l1638

Since we have a different workflow this doesn't have to recreate the same command line options but should produce compatible output

jellyfish count [-> jellyfish merge] -> jellyfish histo -> jellyfish dump

Usage: jellyfish histo [options] db:path

Create an histogram of k-mer occurrences

Create an histogram with the number of k-mers having a given
count. In bucket 'i' are tallied the k-mers which have a count 'c'
satisfying 'low+i*inc <= c < low+(i+1)*inc'. Buckets in the output are
labeled by the low end point (low+i*inc).

The last bucket in the output behaves as a catchall: it tallies all
k-mers with a count greater or equal to the low end point of this
bucket.

Options (default value in (), *required):
 -l, --low=uint64                         Low count value of histogram (1)
 -h, --high=uint64                        High count value of histogram (10000)
 -i, --increment=uint64                   Increment value for buckets (1)
 -t, --threads=uint32                     Number of threads (1)
 -f, --full                               Full histo. Don't skip count 0. (false)
 -o, --output=string                      Output file
 -v, --verbose                            Output information (false)
 -U, --usage                              Usage
     --help                               This message
     --full-help                          Detailed help
 -V, --version                            Version

Jellyfish histogram files are space delimited.

Usage: jellyfish dump [options] db:path

Dump k-mer counts

By default, dump in a fasta format where the header is the count and
the sequence is the sequence of the k-mer. The column format is a 2
column output: k-mer count.

Options (default value in (), *required):
 -c, --column                             Column format (false)
 -t, --tab                                Tab separator (false)
 -L, --lower-count=uint64                 Don't output k-mer with count < lower-count
 -U, --upper-count=uint64                 Don't output k-mer with count > upper-count
 -o, --output=string                      Output file
     --usage                              Usage
 -h, --help                               This message
 -V, --version                            Version
macmanes commented 9 years ago

I'll +1 the idea of being able to spit out a fasta file of kmers like jellyfish dump does. I assume this functionality does not currently exist. (?)

ctb commented 9 years ago

@macmanes - see sandbox/count-kmers.py and sandbox/count-kmers-single.py.