abi-pangenomics / FindFRs

GNU General Public License v3.0
0 stars 0 forks source link

Would it be possible to make this compatible with recent tools? #2

Open rickbeeloo opened 3 years ago

rickbeeloo commented 3 years ago

First of all, nice tool!

Little background We are interested in extracting shared sequences from a huge collection of genomes. Hence, older algorithms such as cDBG and SplitMEM are not fast enough anymore. We currently use the latest version of TwoPaco, and while they developed algorithms similar to FindFRs, Sibelia and the new SibeliaZ, as you mentioned in your paper these do not allow for insertions. This is troublesome when we look at the output where we e.g. have block > 200nt > block > 200nt > block. As ideally this can be merged based on a provided parameter, as is the case for FindFRs with kappa! Moreover, FindFR seems way faster which would be a huge plus too!

Why it does not work now The dot file produced by TwoPaco is substantially different from that of cDBG, for example, this is what it looks like:

{
    rankdir = LR
    23436394 -> 8202625[color="blue", label="chr=0 pos=0"]
    -8202625 -> -23436394[color="red", label="chr=0 pos=0"]
    8202625 -> -12004346[color="blue", label="chr=0 pos=27"]
    12004346 -> -8202625[color="red", label="chr=0 pos=27"]
    -12004346 -> -4353802[color="blue", label="chr=0 pos=128"]
    4353802 -> 12004346[color="red", label="chr=0 pos=128"]
    -4353802 -> 8202625[color="blue", label="chr=0 pos=137"]
.....

Here:

Note that the graph is a union of graphs built from both strands, with blue edges coming from the main strand and red ones from reverse one. The labels of the edges will indicate its position on a chromosome.

This does have the advantage of encoding both strands in the same .dot making --rc obsolete, however, I wonder whether the current algorithm can be altered to handle this? Or alternatively, if you have an idea to convert the above .dot to one suitable for FindFR?

Thanks!

rickbeeloo commented 3 years ago

Any maintenance here?

bmumey commented 3 years ago

Hi Rick, Sorry for the slow response. We are planning to make some updates to FindFRs; I think we could make it work with the TwoPaco output. I am going to send this now as I am not sure if this email will go through. Brendan Mumey

From: rickbeeloo notifications@github.com Reply-To: abi-pangenomics/FindFRs reply@reply.github.com Date: Sunday, October 11, 2020 at 6:39 AM To: abi-pangenomics/FindFRs FindFRs@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [abi-pangenomics/FindFRs] Would it be possible to make this compatible with recent tools? (#2)

First of all, nice tool!

Little background We are interested in extracting shared sequences from a huge collection of genomes. Hence, older algorithms such as cDBG and SplitMEM are not fast enough anymore. We currently use the latest version of TwoPaco, and while they developed algorithms similar to FindFRs, Sibelia and the new SibeliaZ, as you mentioned in your paper these do not allow for insertions. This is troublesome when we look at the output where we e.g. have block > 200nt > block > 200nt > block. As ideally this can be merged based on a provided parameter, as is the case for FindFRs with kappa! Moreover, FindFR seems way faster which would be a huge plus too!

Why it does not work now The dot file produced by TwoPaco is substantially different from that of cDBG, for example, this is what it looks like:

{

    rankdir = LR

    23436394 -> 8202625[color="blue", label="chr=0 pos=0"]

    -8202625 -> -23436394[color="red", label="chr=0 pos=0"]

    8202625 -> -12004346[color="blue", label="chr=0 pos=27"]

    12004346 -> -8202625[color="red", label="chr=0 pos=27"]

    -12004346 -> -4353802[color="blue", label="chr=0 pos=128"]

    4353802 -> 12004346[color="red", label="chr=0 pos=128"]

    -4353802 -> 8202625[color="blue", label="chr=0 pos=137"]

.....

Here:

Note that the graph is a union of graphs built from both strands, with blue edges coming from the main strand and red ones from reverse one. The labels of the edges will indicate its position on a chromosome.

This does have the advantage of encoding both strands in the same .dot making --rc obsolete, however, I wonder whether the current algorithm can be altered to handle this? Or alternatively, if you have an idea to convert the above .dot to one suitable for FindFR?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/abi-pangenomics/FindFRs/issues/2, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAE7WECY3C6MWHFXUYFNXIDSKGROZANCNFSM4SLX3NCA.

rickbeeloo commented 3 years ago

It came through :)

I saw that in the SibeliaZ paper they cited your paper saying:

"SibeliaZ-LCB is based on the analysis of the compacted de Bruijn graph and uses a graph model of collinear blocks similar to the “most frequent paths” introduced by Cleary et al. (2017). "

However, SibeliaZ does not have the insertion parameter, neither does it seem to account for this upon inspecting the output. So I think it would be awesome to update this tool!

Depending on what you want, it may in fact be easier and more useful to also accept GFA files (instead of the dot and Fasta) as all "recent" tools adhere to GFA2, such as BiFrost, TwoPaco, MiniGraph, and SeqWish.

Did you already start working on the update or you have any idea when you will?

bmumey commented 3 years ago

We haven't started yet. Gathering ideas and plan to include this a new NSF proposal. It's possible we will try and get some changes done somewhat soon (next month or two) as we'd like some preliminary results for the proposal. I think accepting GFA makes sense; that should not be that hard to add, although it may also effect the output too; since we need to map the FR paths back to fasta coordinates.

rickbeeloo commented 3 years ago

The GFA Mapping is not necessary in all cases, for example, in the case of TwoPaco this is encoded as segments (S) in the GFA, and for VGtools this can be obtained as a JSON within the GFA (via xg). Therefore you already know the location of each node in all the input genomes directly from the GFA. While this is a plus it is certainly troublesome to parse this as not all record the positional information the same way and some not at all (e.g. Bifrost). Hence, indeed a .dot may be more suitable.

Divergence SibeliaZ specifically mentions it only works well for closely related genomes. We tried SibeliaZ for genomes from different genera, and noticed that, even though some genomes shared identical sequences, these were split into different blocks in the output. The significant decrease in performance over evolutionary distance is clear from Figure 3 of their paper: image In your paper you mentioned finding more FRs compared to Sibelia. Based on this figure Sibelia performs well however, as you also wrote, it is quite slow. So if FindFR can maintain the Sibelia accuracy (or even better) at the same (or better) speed as SibeliaZ this would be an awesome preliminary result to show you are ahead of the state of the art.

Will try some things I will check if I can mess around a little with the ReadInput class so I can read the TwoPaco dot or GFA file and directly compare SibeliaZ and FindFR when given the same input graph. Since I'm not a dedicated Java programmer this may not go that well so I'm looking forward to your updates :)

rickbeeloo commented 3 years ago

Any update on this?

bmumey commented 3 years ago

Progress should be happening relatively soon - I have a student that work on it now.

rickbeeloo commented 3 years ago

Hey @bmumey! any news on how this is going? :)