fawaz-dabbaghieh / bubble_gun

MIT License
34 stars 2 forks source link

Missing rows in the vcf file obtained using the BubbleGun gfa output file #3

Closed diaspj closed 6 months ago

diaspj commented 1 year ago

Dear BubbleGun team,

I have used your algorithm to determine the number of bubbles and superbubbles in a pangenome graph of a LAB bacteria (the first approach mentioned bellow) and the classic algorithm combining odgi depth / odgi extract with vg deconstruct (the second approach mentioned bellow).

FIRST APPROACH

The command I have used to calculate the bubles and obtained an output file is the following:

BubbleGun -g Lactococcus_lactis_PANGENOME_Anchorwave.gfa bchains --bubble_json output.json

After the conclusion of BubbleGun computation, the output that I have obtained at the terminal was the following:

The number of Simple Bubbles is 3927 The number of Superbubbles is 72 The number of insertions is 534 Sequence coverage of the bubble chains is 3.1945646478383996% Node coverage of the bubble chains is 2.037636626829558% The longest chain seq-wise has 6220 bp The longest chain bubble_wise has 937 bubbles

Hence, the total number of bubbles detected by BubbleGun is:

3927 + 72 + 534 = 4533

Because the software does not make available a VCF file corresponding to the information comprised in the .json file, I have used the gfa output file provided by BubbleGun (under the assumption this file held the same bubble information that the json file held).

With the purpose of using the vg deconstruct command, I have “copy paste” all “nodes” and “paths” comprised in the original input gfa file (including those that were missing from the BubbleGun gfa output file) to the output gfa file, keeping only the “links” determined by BubbleGun software suite. Finally, I used the aforementioned command “vg deconstruct” to obtain the VCF file.

However, the vg command originated a VCF file comprising only 4218 rows…

Considering that each VCF row comprises one genomic variation, each corresponding to one bubble in the pangenome graph, it seems that 315 bubbles (4533 – 4218) are missing from the VCF file.

SECOND APPROACH

Using a second approach based on the odgi software suite (described at https://github.com/pangenome/odgi/issues/381), I have combined the “odgi depth” and “odgi extract” commands to remove the high complexity regions of the graph, as described in the following lines:

odgi depth -t 20 -i Lactococcus_lactis_PANGENOME_Anchorwave.og -w 1000:0:100:0 > low_to_medium_depth_MAX=100.bed

odgi extract -t 20 -d 0 -i Lactococcus_lactis_PANGENOME_Anchorwave.og -b low_to_medium_depth_MAX=100.bed -o Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.og

odgi view -t 20 -g -i Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.og > Lactococcus_lactis_PANGENOMEAnchorwave low_to_medium_depth_MAX=100.pruned.gfa

Then, using the command “vg index”, I have obtained the .xg index file, using the command “vg gbwt”, I have constructed the .gbwt index file.

Finally, I have used the command “vg deconstruct” combined with the .xg and the .gbwt index files to obtain the corresponding .vcf file.

At this point in time, the resulting .vcf file “pruned” using the .bed file obtained at a depth of 100 comprised 4531 rows, only missing two bubbles that were predicted by BubbleGun to be the total number of variations comprised in the pangenome graph (total = 4533).

These two bubbles predicted by BubbleGun that were missing in the approach combining “odgi depth” / “odgi extract” with “vg deconstruct” correspond to the following two rows in the VCF output file:

Lactococcus_lactis_subsp._lactis_275_CP016702.1 49453 <254259<254257 TA AC 60 . AC=0;AF=0;AN=0;AT=>254257>254258>254259,>254257<637681>254259;NS=0;LV=0 GT

Lactococcus_lactis_subsp._lactis_275_CP016702.1 49465 <254268<254266 TTA AAC 60 . AC=0;AF=0;AN=0;AT=>254266>254267>254268,>254266<426258>254268;NS=0;LV=0 GT

OVERALL CONCLUSION:

Taken into consideration the results obtained using the first approach (based on BubbleGun) and the second approach (based on odgi) and combining them into a single vcf file, I was able to recover a .vcf file comprising all variations predicted by BubbleGun software suite.

The question I want to pose to this forum is the following:

Excluding the obvious answer of using the command “vg deconstruct” directly over the original pangenome graph (I have tried this approach, and after more than 3 weeks, this command was still running), is there a more straightforward set of commands based on BubbleGun and complementary vg and odgi toolkits that can be used to obtain a complete .vcf file without having to combine two .vcf files and in a reasonable time frame?

Any help on this issue is welcome,

Best regards,

Paulo Dias

fawaz-dabbaghieh commented 1 year ago

Dear Paulo,

I am not sure if I completely understand your question/problem as I am not very familiar with all the commands used with the other tools, or the graph you're working with.

First of all, BubbleGun only detects bubbles and superbubbles as topological structures in graphs, outputting a VCF means there's some kind of coordinate system in the graph (e.g. a reference path with coordinates to decide which branch is REF and which branch is ALT), or some kind of criteria that decides which branch in the bubble is REF and which is ALT. Depending on what you're trying to achieve, some information can be easily extracted from BubbleGun output. For example, you can use BubbleGun internal functions in your own Python script to output certain information you want, or you can decompose each bubble into two sequences using --fasta from bchains subcommand.

I am not sure what you had to "copy paste" to make things work, when you ask BubbleGun to output the bubble chains as a separate GFA file, then it only outputs those bubble chains, naturally, the rest of the graph will be gone, but then you said you had to copy and paste back the missing nodes, so I am not sure I understood why.

Second, for the two bubbles that you were not sure about, you can also use BubbleGun's bfs subcommand to extract a small part of the graph and visualize with Bandage, for example, use bfs subcommand and with --start to give two node IDs corresponding to nodes for those two bubbles and choose a neighborhood size of 10, for example, to extract 10 nodes around those starting nodes, visualize the result in Bandage to see if these two missing VCF entries are bubbles or not, of course, BubbleGun might have a bug, but with all my testings, I haven't had on occasion where BubbleGun reported a false positive.

If the VCF file you want to get at the end doesn't care about a REF and ALT information, i.e. you only want to decompose simple bubbles into some VCF entry where it says one branch has one sequence and another branch has another, then the easiest way is to use both the JSON output and BubbleGun Graph class to output this information, we can definitely discuss this further and i can help you in writing a such script or we can talk personally and I can show you how to use BubbleGun's internal functions :)

diaspj commented 1 year ago

Dear Fawaz,

The VCF file that is outputted using the command "vg deconstruct" (a command made available by the vg framework that is able to convert bubbles and superbubbles into a VCF file) is the result of an all-against-all pairwise comparison between the 33 genome sequences of a LAB bacteria species that I am presently analyzing.

So, in the resulting VCF file, there will be present REF and ALT information for all pairwise genome comparisons, where at iteration i=1 one genome is selected as REF and the other 32 genome sequences will be the ALT, then then the following genome i=2 is selected as REF and the other 32 genome sequences will be the ALT and so and so until all genomes have been REF and i reaches 33.

My main question is whether the gfa file generated by BubbleGun comprises an equivalent information to the one comprised in the json file generated by BubbleGun?

In addition, I would like to know if you agree that the link "L" information generated by BubbleGun comprises the central information necessary to detect all bubbles and superbubbles in the graph topological structure.

In case this assumption is correct and the above question has a positive answer, I should be able to replace the segment "S" and paths "P" information in the BubbleGun gfa file by the one comprised in the original gfa file, that the number of bubbles and superbubbles should not change in the final output of BubbleGun upon the use of the command "vg deconstruct"

The total number of bubbles detected by BubbleGun should be 3927 + 72 + 534 = 4533

However, when I perform the manipulation mentioned above of the BubbleGun gfa file (keeping only the link "L" information in the BubbleGun file and adding the segment "S" and paths "P" information comprised in the original gfa file) and run the command "vg deconstruct", the resulting VCF file contains 4218 rows.

Conclusion: 315 bubbles (4533 – 4218) are missing.

Paulo Dias

PS: I was assuming that you knew and used the vg, odgi, seqwith, etc graph frameworks, maybe we should talk personally (maybe through a zoom meeting) so I can explain you in detail the problems I have encountered when using your software and the manipulations I have introduced to the BubbleGun gfa file.

fawaz-dabbaghieh commented 1 year ago

Dear Paulo,

To answer the first two questions, it's yes and yes, i.e. the GFA output of BubbleGun should only contain the bubbles and superbubbles, or just simple bubbles if you use --only_simple or only superbubbles if you use --only_super. This GFA output should correspond to the JSON output as well. Of course, bugs are always a possibility, but I cannot determine this without having looked at the data.

I still have some unanswered questions, but I'd be happy to meet and assist if I can, please email me at fawaz@hhu.de and we can set up a time to talk online.

Best, Fawaz