DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
95 stars 7 forks source link

Add explicit sequence anchoring subcommand #3

Closed phiweger closed 4 years ago

phiweger commented 4 years ago

It would be really cool if you could add a subcommand that simply returns the path a genome/ long read takes through the graph (as a sort of graph alignment heuristic). Is that something you can integrate?

GuillaumeHolley commented 4 years ago

Hi @phiweger

It is something possible to integrate but it would take some time to do and we have a couple of ideas that we would like to prioritize for future versions. Do you have a specific use case in mind? A simple thing to do is to use Bifrost to build the graph from the short reads and query the graph with the corrected long reads. That would already give you which k-mers of the corrected long reads are in the graph or not. Using the Bifrost API to pinpoint exactly in which unitigs these k-mers occur wouldn't be too much additional troubles.

phiweger commented 4 years ago

I want to implement kmer indexing using bifrost -- there the idea (Dilthey et al.) is to sequence barcoded isolates using Illumina, build the assembly graph for each isolate, and then bin unbarcoded Nanopore reads from the same isolates to their respective barcode by "aligning" them to either one of those graphs (this saves ligase which is an important cost factor in the Nanopore protocol).

How would I use bifrost to query the kmers. Ie if I use "bifrost query" it does try to find all --ratio kmers. So should I decompose each read into its kmers (put them in a fasta file) and them search them?

Thanks for your help!

GuillaumeHolley commented 4 years ago

Very interesting! It seems to me you would need to build the colored de Bruijn graph of the barcoded Illumina isolates using Bifrost, one color for each barcode/isolate. By using bifrost query with a reasonable k-mer ratio, the tool with already tell you for each ONT read in which color (barcode/isolate) you reads "align" to. Since you would run this from uncorrected ONT reads, I would even advise to run this using the inexact k-mer querying mode of Bifrost. It would increase the number of anchors between your ONT reads and the graph (at a cost of a longer running time) for a more accurate result. No need to decompose your ONT reads into k-mers, just submit your fasta/fastq file as a query file, Bifrost will take care of the decomposition. Unless you want to know specifically which k-mer maps to the graph and where. If it is the case I would advise to dig the Bifrost API a little bit to make your own tool, it wouldn't take much code. Even with pre-decomposed k-mer queries in input, bifrost query will not tell you where your k-mers map in the graph while the API can do that rather easily.

phiweger commented 4 years ago

thanks a lot, that gets me started. w/ API you mean the C++ one? I read in some issue that there are (unofficial) Python bindings?

GuillaumeHolley commented 4 years ago

By API, I mean the C++ one indeed. Lucas van Dijk has been making a fantastic work on the python bindings for Bifrost (https://github.com/broadinstitute/pyfrost) but I think the project is still in its early stages. I'm not sure how ready it is and how many of the Bifrost functionalities are available so if Python is essential here, better ask Lucas himself before getting started. Otherwise, I would be happy to answer any question or provide tips regarding the C++ API.

phiweger commented 4 years ago

ok thanks again! I tried pyfrost but there are still some issues concerning installation and basic operations so for now I'll try the C++ API :)