DecodeGenetics / svimmer

Structural variant merging tool
44 stars 9 forks source link

Support for .csi indexed vcfs #2

Closed MaximilianStammnitz closed 4 years ago

MaximilianStammnitz commented 4 years ago

Hi Hannes,

I am trying to merge SV calls from marsupial chromosomes. Some of these are >> 512 Mb in length, and hence need to be indexed via tabix -C -p vcf. This doesn't create a .tbi index, but a .csi one instead (a bit more explanation here).

However, svimmer currently relies on .tbi inputs only - could you possibly at .csi support? Happy to provide you with log files & tests if this helps.

Many thanks, Max

hannespetur commented 4 years ago

Hello and thanks Max,

svimmer parses VCFs using a library called pysam and it has csi support version 0.14 https://github.com/pysam-developers/pysam/commit/a8304363b61723b8067df5e2d460c0db96dbb326

So I think I only need to add a few of lines of code in svimmer just to detect the presence of a csi index. I will make a pull request for it soon.

Best, Hannes

MaximilianStammnitz commented 4 years ago

Many thanks for your quick reply, @hannespetur - looking forward to test this.

Later, I'd also be keen to genotype these SVs on >>512 Mb chromosomes via Graphtyper. Wondering if Graphtyper also strictly relies on .tbi? Guess the input fix to both would be quite a similar one, but I'm happy to open a separate issue down the road.

Best wishes, Max

hannespetur commented 4 years ago

You are welcome. csi indices work for me on the feature_csi branch on a very small test, it would be great if you can checkout that branch and test it on your file.

Unfortunately, there is not csi index support in graphtyper. I will look into adding it but I think it will probably be a bit tougher to add since the library I am using for vcf reading doesn't have support for it. There is also a limitation in graphtyper that the total genome size cannot exceed 4 billion bp (genome position needs to fit in 32-bit integers) which is perhaps also a problem for your case. It is good to know there is interest for these features.

Best, Hannes

MaximilianStammnitz commented 4 years ago

Hi Hannes,

Just checked your dev branch: svimmer's .csi support now also works smoothly for my examples, well done and thanks for the quick processing! 👍

... a bit unlucky with regard to the incompatible VCF library in graphtyper. I've just tested this: indeed, SVs are genotyped well for our chromosome sets - as long as none of the breakpoints reach into a segment >>512 Mb. The overall size of most marsupial genomes is still comparable to human and < 4 Gb, however they only have ~ 6-10 (very large) chromosome pairs. So .csi support would still be very helpful in this case.

While .csi can't be supported yet, do you have a best practice recommendation for SV genotyping besides graphtyper? (with original calls made by Manta)

Many thanks, Max

hannespetur commented 4 years ago

Okay, thanks for testing it and the info. No sorry, I don't have any particular recommendation.

Best, Hannes

MaximilianStammnitz commented 4 years ago

No worries, hoping to get the full VCFs into Graphtyper via .csi support soon - genotyping results on SVs in the < 512Mb ranges look very promising on our end; I will open a separate issue for this.