alexpreynolds / kmer-counter

Count kmers with a more efficient (faster) hash table
MIT License
24 stars 5 forks source link
bioinformatics cpp14 genomics hash-map kmer kmer-counting

kmer-counter

Compilation

Run make to build the kmer-counter binary.

This has been compiled under Ubuntu 18.04.4, Cygwin 3.1.4, and Mac OS X 10.15.3, using concurrent GCC/glibc and Clang toolkits.

Usage

Run kmer-counter --help for a list of options.

There are a couple ways to use this.

  1. You can provide a single-line FASTA input and write counts to standard output, e.g.:
$ ./kmer-counter --fasta --k=6 sequences.fa
>foo    CGTTAA:1 TTAACG:1
>bar    TTCTTA:1 TAGGGC:1 AAATTC:1 GTGGAA:1 AACTTC:1 ...
...
  1. For a more complex use case, you can provide a four-column BED file with the interval's genomic sequence in the fourth column (i.e., ID field), along with the number k for the k-mers you want to count, an offset value for mer-keys (explained below), and a results directory to write results, e.g.:
$ ./kmer-counter --bed --k=6 --offset=12195 --results-dir="6mers" intervals.bed4

The above example generates 6-mers of the sequences from the file intervals.bed4.

The results are stored in a folder called 6mers, which contains two files count.bed and map.txt.

The first file count.bed contains a BED file of intervals from intervals.bed4, where the fourth column contains a space-delimited pair of "mer"-keys and the number of times that key is seen. Mer-keys are numbers which begin at the offset value provided on the command-line.

The second file map.txt contains a tab-delimited pairing of mers and their mer-key, as found in count.bed.

Notes

I am using a hash table implementation from Emil Ernerfeldt. A discussion about performance characteristics compared with the C++ STL std::unordered_map is available from the author.