Open ZhuangZK opened 5 years ago
I don't know if khmer provides any way to do this out-of-the-box. The problem is that the CountMin sketch (Countgraph
or Counttable
objects in khmer) don't store the k-mer sequence, only the k-mer's hash value. If you know the k-mer sequence, you can query for its abundance, but you can't determine the k-mer from the CountMin sketch alone.
One way to do this would be to count k-mers with load-into-counting.py
, and then iterate over the reads again and query the count of each k-mer. If you weren't careful, you'd end up many of the k-mers multiple times, which is probably not what you want. Storing the k-mers so that they are only reported once will take A LOT of memory depending on the number and size of your sample(s).
Is there a specific reason you need every k-mer sequence? If you let us know what you're trying to do, perhaps we can help you find an alternative path to your goal.
see abundance-dist.py and abundance-dist-single.py in scripts/ for code that does this :)
The result of abundance-dist.py is like below abundance,count,cumulative,cumulative_fraction 0,0,0,0.0 1,6694694,6694694,0.48 2,906389,7601083,0.545 3,592628,8193711,0.588 4,524304,8718015,0.626 5,488859,9206874,0.661
I wonder if I can get the exect K-mer suquence instead of the number in the first col (BOLD).