Genotyping 808 samples - Githubissues

Jordi-V commented 5 years ago

Dear Jonas,

I read the paper and I've to congratulate, because is really interesting! I contact to you because I'm interested to use BayesTyper to genotype my samples. I've 808 samples (at 30X) which are ran independently by different variant callers, some of them are CNVnator, Platypus and Manta. First of all I merge all the vcfs with one tool named SURVIVOR. Now I want to genotype all my samples together, in order to take the genotypes likelihoods prior and posterior respectively. But as I read in the documentation, I've to do batches in order to do that... So could you tell me If is possible use Bayestyper to genotype all my samples by caller?? I've to merge all vcfs and use Bayestyper?

Thanks for your help and time

Jordi

jonassibbesen commented 5 years ago

Dear Jordi,

Thank you for writing.

It is currently not possible to run BayesTyper on more than 30 samples at a time. Therefore, as you also mentions, you will need to run your samples in batches. We have written a small guide about how to do this on the wiki.

My recommendation would be that you combine the variants predicted across all 808 samples into a single vcf file and then run bayestyper on this in batches of 10-15 samples. The guide says to run in batches of 30 samples, however I have previously had good experience running on smaller batch sizes (10-15 samples). This should also help with computation time, since the genotyping step can scale almost quadratic with the number of samples for complex graphs. After you have genotyped all the batches you can then combine them into a single vcf using the bcftools merge command described in the guide.

Hope it answered your questions.

Cheers,

Jonas

Jordi-V commented 5 years ago

Hi Jonas,

thanks for your reply, but if I use batches of 15 or 30, do you consider that genotype likelihood (GL) will be affected because is not representative of my cohort? Because I want to know which is the probability to appear one genotype in my population, and if I use a subsample of my cohort this is not correct right? I want to use GL to do pahseing with some program like SHAPEIT2, and I consider that GL is importat in order to have my haplotypes... So if I use a subsample of my cohort I dont bias my results? Is correct do that?? because in order to calculate the GL I want to use reference panels or databases plus my cohort for priors and posteriors...

So the number of samples is not important in order to calculate GL? Use all samples to calculate the GL is the best option??

Thanks for your reply and help

Jordi

jonassibbesen commented 5 years ago

Thank you for the questions. It is an interesting topic.

First, I want to mention that BayesTyper does not provide genotype likelihoods. It instead calculates and provides a genotype posterior probability. I am not sure whether SHAPEIT2 is able to use posterior probabilities instead of likelihoods.

Regarding the questions about batching. In BayesTyper the haplotype candidates and the population prior on the haplotype frequencies are the main things shared across samples. While the population prior can help to inform genotyping across samples, its influence on the posterior diminishes as the number of kmers across a variant and coverage increases. How important the prior is on 30x coverage is hard to say, however from experience I would guess that its influence on the posterior is small with that coverage. I would therefore not worry about batching when genotyping your samples.

In general I would say that it is more important that the samples are genotyped on the same set of variants in order to get proper posterior estimates for each variant across samples.

Hope it helped clarify the issue a bit.

bioinformatics-centre / BayesTyper

Genotyping 808 samples #10