Can I get the detailed K-mer abundance information?

ZhuangZK commented 5 years ago

The result of abundance-dist.py is like below abundance,count,cumulative,cumulative_fraction 0,0,0,0.0 1,6694694,6694694,0.48 2,906389,7601083,0.545 3,592628,8193711,0.588 4,524304,8718015,0.626 5,488859,9206874,0.661

I wonder if I can get the exect K-mer suquence instead of the number in the first col (BOLD).

standage commented 5 years ago

I don't know if khmer provides any way to do this out-of-the-box. The problem is that the CountMin sketch (Countgraph or Counttable objects in khmer) don't store the k-mer sequence, only the k-mer's hash value. If you know the k-mer sequence, you can query for its abundance, but you can't determine the k-mer from the CountMin sketch alone.

One way to do this would be to count k-mers with load-into-counting.py, and then iterate over the reads again and query the count of each k-mer. If you weren't careful, you'd end up many of the k-mers multiple times, which is probably not what you want. Storing the k-mers so that they are only reported once will take A LOT of memory depending on the number and size of your sample(s).

Is there a specific reason you need every k-mer sequence? If you let us know what you're trying to do, perhaps we can help you find an alternative path to your goal.

ctb commented 5 years ago

see abundance-dist.py and abundance-dist-single.py in scripts/ for code that does this :)

dib-lab / khmer

Can I get the detailed K-mer abundance information? #1890