bioinformatics-centre / BayesTyper

A method for variant graph genotyping based on exact alignment of k-mers
86 stars 7 forks source link

SV Typing only #18

Closed jjfarrell closed 4 years ago

jjfarrell commented 4 years ago

If one would like to only genotype SVs (to minimize the CPU time), what would be the recommendation for the candidate variants file. For example, would limiting the SNVs to high minor allele frequency speed things up at all? Or Limit the SNVs to those found in the samples and not the larger dbSNP list. Or is it best to include the comprehensive set of SNVs for the algorithm to work optimally. We have vcfs from Lumpy, Delly, Manta, Strelka2, Scalpel and GATK HaplotypeCaller.

I would like to try BayesTyper on 5000 30x WGS for genotyping SVs. Also are there any benefits for running in batches versus 5000 single runs for these crams. What would you recommend?

jonassibbesen commented 4 years ago

Hi,

Using the SNVs found in your 5000 samples should be sufficient. The advantage of using an annotation is that it might contain SNVs close to the SV breakpoint that e.g. GATK HaplotypeCaller did not find. However, since you have many samples I would not worry too much about that being an issue. Also, including all of dbSNP variants would probably results in worse results due to the increased complexity of the graphs. I would therefore recommend that you only use the predicted SNVs. For indels and especially SV I would recommend using an annotation.

Regarding running in batches. Given your relative high coverage it is my experience that you do not gain much by running in batches compared to single sample. I would therefore recommend running 5000 single samples on the same set of variants. It should also speed up the computations significantly.

Please let me know if you have any other questions.

Cheers,

Jonas

biozzq commented 4 years ago

Dear @jonassibbesen

Could I use only the candidate SV variants as the candiates as I only focused on the CNVs? I found that most users would use both the SNPs and SVs as candiates to improve power. So I hesitated and asked for help here.

Another question, If I want to genotype SVs at the population level, however, the breakpoints of the SVs may be different among different individuals. So, I will first genotype the candidate variants one by one, and then merge the variants.

Sincerely, Zheng Zhuqing

jonassibbesen commented 4 years ago

Yes, it is possible to run on only SVs with the latest release (https://github.com/bioinformatics-centre/BayesTyper/releases/tag/v1.5). However, in my experience the results are generally better when SNV and indels are also included.

I would recommend merges all SVs across all individuals and run each sample on the combined set.

biozzq commented 4 years ago

Dear @jonassibbesen

You said that the breakponits can affect the genotype of SVs (https://github.com/bioinformatics-centre/BayesTyper/issues/32#issuecomment-633709315). When merging all SVs across all individuals, it will be difficult to keep the precise breakpoint for all individuals because different individuals may have different breakpoints for one SV. Hope you can give me some suggestions when processing the population SVs using short sequencing reads. Thank you.

Sincerely, Zhuqing