Open olgabot opened 4 years ago
Here's a schematic of what I'm thinking of doing:
I want to be able to query with a hash, and get all reads containing that hash, then use those read IDs to query the bam. I think this is possible given the make_bgzf.py
and overall spacegraphcats/utils/bgzf/
folder of tools.
But then will all the querying need to happen with SQLite as in label_cdbg.py
? I'm afraid of SQL...
cc @ctb
this should be lightweight and straightforward if you are using downsampled hashes (either regular MinHash or scaled hash, as in sourmash). "All k-mers" is hard, might look at BLight (https://www.biorxiv.org/content/10.1101/546309v2), happy to put you in touch with people in that group!
I have been using sqlite for ages, because it's so blindingly fast that there's no hope of competing. See http://ivory.idyll.org/blog/storing-and-retrieving-sequences.html.
sqlite is also ridiculously robust and well tested, and very widely used, with interfaces in most languages. Well worth the time investment in my experience.
Use tools from spacegraphcats to do the indexing https://github.com/spacegraphcats/spacegraphcats/issues/273