dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
748 stars 294 forks source link

How to export kmer abundances? #1862

Open olgabot opened 6 years ago

olgabot commented 6 years ago

Hello there! I'm interested in using khmer to count and filter kmers from transcriptome datasets and then use the raw kmer counts across multiple (~10) samples as the "X" matrix for a classification algorithm e.g. in scikit-learn or tensorflow. How can one combine and extract the kmer counts to say an hdf5 or sparse matrix file? Warmest, Olga

ctb commented 6 years ago

Hi Olga, I think this is better done with sourmash, actually; sourmash applies a random subsampling algorithm (MinHash) to extract a subset of k-mers, which is much more manageable than using all of them! We have a decent-ish Python API (well, the API is fine, but the docs are a bit underpolished).

If you really want all k-mers, then I can point you towards another set of code, bbhash (https://github.com/dib-lab/pybbhash and references there-in - we just wrote a python wrapper for it :) that will let you construct a minimal perfect hash function for tracking (and counting) k-mers.

The latter route involves a bit more alpha code but I can give you instructions for getting started. Thoughts on which approach seems more interesting?

olgabot commented 6 years ago

Thanks for the response, I'll try out sourmash for now! I'm not deep enough into the project to want to deal with alpha code :)

Will keep you posted!

olgabot commented 6 years ago

though if I have time, is there a way to use pybbhash on existing kmer graphs created by khmer? I'd like to put these 60GB files to good use

standage commented 6 years ago

Is there a way to use pybbhash on existing kmer graphs created by khmer?

I don't think so. The hashing strategy is quite different between khmer and the MPHF used by pybbhash. :(

phiweger commented 3 years ago

@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?

luizirber commented 3 years ago

@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?

Hmm, I guess you could use scaled=1 and the BTree-backed MinHash from https://github.com/dib-lab/sourmash/pull/1045, but this is only exposed to sourmash compute CLI, not in the Python API*. But that would still use a lot of memory for large datasets...

ctb commented 3 years ago

a few things --