DecodeGenetics / graphtyper

Population-scale genotyping using pangenome graphs
http://dx.doi.org/10.1038/ng.3964
MIT License
171 stars 20 forks source link

Using a SV vcf from other callers than Manta #63

Open clairemerot opened 4 years ago

clairemerot commented 4 years ago

Hello, I'm trying to use Graphtyper as it seems to be fast and to scale well to many samples. Very easy to install and run so far.. but I'm unsure that I'll be able to make the most of it.

My understanding from this issue https://github.com/DecodeGenetics/graphtyper/issues/42 is that it is mostly able to use as a catalogue of variants the output of Manta (which can be filtered by discovery with other tools as suggested in issue 42). Is this still up-to-date or is it now possible to use vcf from other tools? I'm thinking of course in Sv detection by long-reads (perhaps the output of sniffles?)...

I am also trying right now to use a SV database built with Smoove but I'm afraid I'm facing the problem of the lack of information (SVINSSEQ?) because all SV genotyped by Grpahtyper were called "LowQUAL".

Has anyone suggestions or advice to make the most of existing Sv database for genotyping Sv with Graphtyper?

BonusQuestion: How is the quality of the SV call determined? Is it based on coverage and is it a parameter that we could/should tune depending on the dataset?

Thanks a lot for your help Best regards Claire

hannespetur commented 4 years ago

Hello,

Yes, this is still up-to-date, genotyping Manta variants has been yielding the best results in our evaluations. Mostly because Manta has more information on the breakpoint sequence, like you mentioned. I don't know if Sniffles outputs has inserted bases or if the breakpoint coordinates are typically accurate.

Graphtyper assumes the input breakpoint information is accurate and realigns the reads to the SV breakpoint sequences, but if the input is inaccurate the realignment will be worse and the results will be suboptimal. It is a limitation of the method. I am not aware of any method or tool to error correct the input to fix this. If the SV database wasn't created with Manta then likely many breakpoint sequence will be incorrect. Perhaps the best option for an SV database created with smoove is to also use smoove to call the SVs. I find it very surprising that all of SVs were LowQUAL though, I mean I would expect some of the SVs to have the exact coordinates and do not have any bases inserted. I'd love to see some examples to better understand what's going on, if that is possible.

The quality (QUAL) is essentially the log likelihood that the ALT is not a variant. It determined from coverage and also the allele balance and quality of the graph alignments supporting the alternative allele. There is also QD (quality by depth) which is normalized by alternative sequencing depth which is often more useful then raw QUAL value.

Best, Hannes

jjfarrell commented 4 years ago

For input in Graphtyper, I have converted the non-manta descovery calls into a ref and alt sequence without the symbolic <INS> <DEL> <DUP>. That seems to work. For input into graphtyper, I also only use the precise calls from the sv callers. For example, Delly labels their calls as PRECISE and IMPRECISE so I filter out the imprecise variants. Scalpel also does a good job for precise breakpoints but is computationally slow.

clairemerot commented 4 years ago

Thanks to both of you for your answers. Hannes, you are right, a few of them did indeed have a "PASS" although that was very rare. I'll give it a try with smoove directly. So regarding quality, you'd suggest to rather trust QD? (I have manta's found SV on another project which should run well)

Thank you for the suggestion to do the conversion @jjfarrell , was it from Delly?

Has anyone used the output of long-read discovered SV for subsequent genotyping? I'm trying with vg too but Graphtyper seems easier to run!

Thanks a lot Claire

jjfarrell commented 4 years ago

@clairemerot I have downloaded PacBio Hi-Fidelity SVs and other PacBio calls and run them though graphtyper. The HiFidelity PacBio calls seem more precise and work better for genotyping than the earlier PacBio technology.

I have converted from Delly (Filtering on precise), Scalpel and various PacBio calls.