kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix

mr-eyes commented 2 years ago

Extending https://github.com/sourmash-bio/sourmash/issues/1750, I am copying a conversation between @ctb and @drtamermansour to be detailed later into tasks.

Slack Conversation

Tamer Mansour 8:38 AM **@titus @taylorreiter We can solve this problem VERY efficiently by using kProcessor/kSpider.** Titus Brown:speech_balloon: 8:39 AM **it’s not hard to do in sourmash, either, although I suspect kProcessor/kSpider has specialized lookup tables that make it even faster.** 8:39 **the issue is integrating it into the CLI.** Tamer Mansour 8:41 AM **kProcessor/kSpider can perform pairwise calculation of shared kmers for 20k genes in a couple GB space and few minutes** 8:42 **We can prepare the output in a format that sourmash can read** 8:44 **All what we need is to implement a simple parser for sourmash signature files** 8:45 **In the CLI, i think sourmash can call kProcessor/kSpider under the hood** Titus Brown:speech_balloon: 8:46 AM **so, two issues here -** **Rob is talking about (potentially) very large metagenomes, and the sourmash downsampling approach is probably quite important for his application, since metagenomes can be so much larger than transcriptomes. e.g. “a couple GB space” rapidly becomes 100s of GB.** **As Taylor experienced, sourmash isn’t doing a good job with sparse matrices, either, so if you have 20k x 20k queries you run out of memory regardless :)** **In this case, kProcessor/kSpider would be completely replacing sourmash, not producing output for it to read.** **I think what we want is something similar to what you suggested - a parser for kProcessor/kSpider to read sourmash sig files. (edited)** 8:46 **Integrating kProcessor into sourmash is not simple or easy, especially since we moved sourmash over to rust.** Tamer Mansour 8:50 AM **Sourmash does not need to integrate kProcessor. Just use it as a third party tool. Sourmash has to do the downsampling, prep the signature files, call kProcessor/kSpider module, and finally present the output through the sourmash visualization scripts (edited)** Titus Brown:speech_balloon: 8:50 AM **call kProcessor/kSpider module** 8:51 **if we did it that way, sourmash would now include kProcessor/kSpider as a dependency :slightly_smiling_face:** Tamer Mansour 8:51 AM yes 8:51 **it is a python package** 8:52 **This is what kProcessor is made for :slightly_smiling_face:** Titus Brown:speech_balloon: 8:54 AM **I don’t think that’s a good idea; sourmash is pretty strict about versioning and dependencies.** **It should be pretty straightforward to have kProcessor read sourmash sig files (it’s just k-mers and hashes!), do the comparison, and output a numpy matrix that can be read by sourmash plot. In this case I’m pretty sure Rob (and Taylor) don’t want to use the viz tools, anyway, which won’t scale to that number of samples.** 8:55 **If you put together a demo of the 20kx transcript query somewhere, we can suggest it, even without the downsampling.** 8:56 **If we had an extensions framework in sourmash, could do it that way, too.** 8:56 **But I don’t think we want sourmash to have kProcessor as a required dependency.** Tamer Mansour 8:57 AM **Sure we can prep a demo. But our current indexing can not make 20k whole datasets.** 8:57 **Mostafa is working on the new indexing algorithm to do so in the soon future** Titus Brown:speech_balloon: 8:58 AM **So this seems like a good use case for future development, but it’s not something we should suggest to Rob right now ;)** Tamer Mansour 8:59 AM **We can implement downsampling in kProcessor. That should be even easier than developing a parser for sourmash signatures** 9:00 **I can make something to try next week** Titus Brown:speech_balloon: 9:00 AM **yes, it’s easy, but now you have distinct code bases and you don’t get the advantage of all the sourmash signature manipulation utilities. If it’s simple to parse sourmash signatures (which it should be - they’re “just” JSON, plus we have a sourmash API to load it) then might as well add that too.** 9:01 **If nothing else, it’s a good way to test your downsampling code in kProcessor by running the same operations in kProcessor directly vs loading with sourmash.** 9:02 **(the downsampling code is now pretty simple in sourmash, but it took a while to get there, and we have a LOT of tests for it. so it’s pretty robustly tested. No reason to discard that.)** Tamer Mansour 9:03 AM **This is also a good solution** Titus Brown:speech_balloon: 9:03 AM **also check out the sourmash sketch documentation. Soooo maaaaany oppppppptions to implement. Ugh.** Tamer Mansour 9:05 AM **I was thinking of a very simple approach. Incorporating one kmer every 1000 while reading the input sequences. That is it :smile:** :+1: 1 9:05 **But using sourmash is a better idea** Titus Brown:speech_balloon: 9:06 AM **I see value in both, TBH. No reason to force people to jump through hoops either way, and as you say, the code is simple for scaled downsampling.** :+1: 1 Titus Brown:speech_balloon: 9:28 AM **One other thought - the seq-to-hashes stuff that @mr-eyes implemented for sourmash could be used directly by kProcessor to build scaled=1 dataframes in DNA space (which you already have working) as well as translated and protein queries. Again, ultimately you probably want this in kProcessor directly, but it’s a pretty simple call to sourmash to get the functionality working right now.**

ctb commented 2 years ago

note that sourmash is lower case 😁

ctb commented 2 years ago

Some sourmash code - haven't tested it, but it should mostly work :)

Load signatures from ...anything - a .sig file, a .zip file, a directory:

>>> loaded_sigs = sourmash.load_file_as_signatures(fpath)

You might want to select out only the scaled signatures, since num signatures are a different beast and can't really be used the way we would like:

>>> loaded_sigs = loaded_sigs.select(scaled=True)

Retrieve sketches:

>>> for ss in loaded_sigs:
...   mh = ss.minhash

Retrieve ksize and moltype and scaled/num from the sketches:

>>> ksize = mh.ksize
>>> moltype = mh.moltype # 'DNA', 'protein', 'dayhoff', 'hp'
>>> scaled = mh.scaled # if 0, this is a 'num' sketch

Get actual hashes:

>>> for hashval in mh.hashes:
...    print(hashval)

Retrieve abundances:

>>> for hashval, abund in mh.hashes.items()
...    print(hashval, abund)

🎉

drtamermansour commented 2 years ago

@ctb The same sourmash signature might have kmers of different sizes, is not it?

ctb commented 2 years ago

it's complicated but tl;dr the code above will work, because each signature will be made to have exactly one sketch.

(the signature creation and save code does things slightly differently; see https://github.com/sourmash-bio/sourmash/issues/1647 and https://github.com/sourmash-bio/sourmash/issues/616 esp), but the load code splits it out so it's one signature for one sketch, and each sketch has one ksize.)

you might also be getting confused because a single .sig file can contain many different signatures, as well as signatures with multiple sketches.

so, as I said, confusing. but the code above will work.

dib-lab / kProcessor

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90