Open olgabot opened 6 years ago
Hi Olga, I think this is better done with sourmash, actually; sourmash applies a random subsampling algorithm (MinHash) to extract a subset of k-mers, which is much more manageable than using all of them! We have a decent-ish Python API (well, the API is fine, but the docs are a bit underpolished).
If you really want all k-mers, then I can point you towards another set of code, bbhash (https://github.com/dib-lab/pybbhash and references there-in - we just wrote a python wrapper for it :) that will let you construct a minimal perfect hash function for tracking (and counting) k-mers.
The latter route involves a bit more alpha code but I can give you instructions for getting started. Thoughts on which approach seems more interesting?
Thanks for the response, I'll try out sourmash for now! I'm not deep enough into the project to want to deal with alpha code :)
Will keep you posted!
though if I have time, is there a way to use pybbhash
on existing kmer graphs created by khmer? I'd like to put these 60GB files to good use
Is there a way to use pybbhash on existing kmer graphs created by khmer?
I don't think so. The hashing strategy is quite different between khmer and the MPHF used by pybbhash. :(
@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?
@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?
Hmm, I guess you could use scaled=1
and the BTree-backed MinHash from https://github.com/dib-lab/sourmash/pull/1045, but this is only exposed to sourmash compute
CLI, not in the Python API*. But that would still use a lot of memory for large datasets...
a few things --
Hello there! I'm interested in using
khmer
to count and filter kmers from transcriptome datasets and then use the raw kmer counts across multiple (~10) samples as the "X" matrix for a classification algorithm e.g. in scikit-learn or tensorflow. How can one combine and extract the kmer counts to say an hdf5 or sparse matrix file? Warmest, Olga