czbiohub-sf / nf-predictorthologs

*de novo* orthologous gene predictions from bam + bed or fasta/fastq data
MIT License
4 stars 2 forks source link

Index fasta files by containing hashes, and bam files by containing read ids #50

Open olgabot opened 4 years ago

olgabot commented 4 years ago

Use tools from spacegraphcats to do the indexing https://github.com/spacegraphcats/spacegraphcats/issues/273

olgabot commented 4 years ago

Here's a schematic of what I'm thinking of doing:

Screen Shot 2020-05-14 at 11 50 37 AM

I want to be able to query with a hash, and get all reads containing that hash, then use those read IDs to query the bam. I think this is possible given the make_bgzf.py and overall spacegraphcats/utils/bgzf/ folder of tools.

But then will all the querying need to happen with SQLite as in label_cdbg.py? I'm afraid of SQL...

cc @ctb

ctb commented 4 years ago

this should be lightweight and straightforward if you are using downsampled hashes (either regular MinHash or scaled hash, as in sourmash). "All k-mers" is hard, might look at BLight (https://www.biorxiv.org/content/10.1101/546309v2), happy to put you in touch with people in that group!

I have been using sqlite for ages, because it's so blindingly fast that there's no hope of competing. See http://ivory.idyll.org/blog/storing-and-retrieving-sequences.html.

sqlite is also ridiculously robust and well tested, and very widely used, with interfaces in most languages. Well worth the time investment in my experience.