Illumina / gvcfgenotyper

A utility for merging and genotyping Illumina-style GVCFs.
Apache License 2.0
32 stars 2 forks source link

Issue with merging gVCFs using a BED file to define regions #16

Closed dalmiaa closed 2 years ago

dalmiaa commented 2 years ago

Hi there!

I have about 9700 genomic VCFs that I want to firstly filter using a BED file that defines about 500K SNPs, and merge into a single multisample VCF but as far as I can see, gvcfgenotyper only accepts the-r option to specify individual locations or chromosomes but not the-R wherein we can input a full BED file with the regions of interest? What is the best way, then, to go about this?

olest commented 2 years ago

Hi @dalmiaa, I'm sorry but gvcf genotyper does not provide this option. Even it had this option, I fear that it would be inefficient for your task.

It sounds to me that you want to perform a force-genotyping of these 500K snps. I also noticed that you are affiliated with Genomics England and I know that their research support teams have done similiar experiments.

It is very likely that they have a pipeline to do this or maybe already the output that you require.

I would suggest contacting Genomics England (by raising a service desk ticket). If they cannot help you, please do post here again and I can try to recommend some options.

Ole

jaredo commented 2 years ago

You could do this directly from cram using bcftools mpileup. Should be fast.

On Fri, Nov 5, 2021, 5:53 AM olest @.***> wrote:

Hi @dalmiaa https://github.com/dalmiaa, I'm sorry but gvcf genotyper does not provide this option. Even it had this option, I fear that it would be inefficient for your task.

It sounds to me that you want to perform a force-genotyping of these 500K snps. I also noticed that you are affiliated with Genomics England and I know that their research support teams have done similiar experiments.

It is very likely that they have a pipeline to do this or maybe already the output that you require.

I would suggest contacting Genomics England (by raising a service desk ticket). If they cannot help you, please do post here again and I can try to recommend some options.

Ole

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/gvcfgenotyper/issues/16#issuecomment-961871931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACVIDXYISN4A52HVLPBSE3UKPHURANCNFSM5HHU3Q5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

dalmiaa commented 2 years ago

Hi @jaredo, I'm attempting to merge the gVCFs using bcftools merge and including the -g tag to specify that they're gVCFs. I've used the -R option to add the bed file with the genomic regions for the 500K SNPs. What do you think of that?

jaredo commented 2 years ago

I'm not sure, best to ask the bcftools developers. Note there are various flavours of GVCF and bcftools might be expecting a different format.

Calling directly from CRAM with bcftools mpileup | bcftools call -C alleles would give you the desired result fairly quickly.

dalmiaa commented 2 years ago

Hi @dalmiaa, I'm sorry but gvcf genotyper does not provide this option. Even it had this option, I fear that it would be inefficient for your task.

It sounds to me that you want to perform a force-genotyping of these 500K snps. I also noticed that you are affiliated with Genomics England and I know that their research support teams have done similiar experiments.

It is very likely that they have a pipeline to do this or maybe already the output that you require.

I would suggest contacting Genomics England (by raising a service desk ticket). If they cannot help you, please do post here again and I can try to recommend some options.

Ole

Thanks, Ole. I have raised a ticket on the GEL service desk. I was wondering why this task would be inefficient using gvcfgenotyper, however? I have used it on these gVCFs successfully before with a single site or for an entire chromosome. So, I was just wondering why that would not be possible using a BED file for multiple regions?

Additionally, do you think generating multiple multi-sample VCFs per chromosome and then extracting the 500K SNPs would be a better way to go about it?

olest commented 2 years ago

Hi @dalmiaa , sorry for the late response. I did not get a notification from this thread.

My concern is with the number and size of the regions. In general, there is no problem with restricting the output of gVCF Genotyper for multiple regions.

Either way, based on your initial question I suspect that what you want is actually a force-genotyping of your 500K SNP sites. This is not what gVCF genotyper can do. If my suspicion is correct, I think a better option is to supply your list of SNPs at variant calling stage or use bcftools mpileup as @jaredo suggests above.

dalmiaa commented 2 years ago

Hi @olest, no worries. I spoke with the developers of bcftools, and told them that I was trying to merge all of the gVCFs specifying the regions of interest, and including the flag specifying that the input is that of genomic VCFs, and they seemed to think that should work ok. The output, so far, looks good to me, too.

I did have one question though, if I were to use gvcfgenotyper on chromosome 4, generate a joint VCF with only chr4 which is my chromosome of interest, and then used bcftools to extract the 500K SNPs, is that something you think would do the job?

olest commented 2 years ago

Hi @dalmiaa , the problem is that gVCF Genotyper writes a msVCF which contains only the variants in the input gVCF. So if you merge chr4 and then extract your SNP positions using, you would get an output only at the positions where one of the input gVCF had a variant overlapping one of your SNP positions. All other queries would come back empty.

But since you are speaking of SNP positions, one approach that might get you what you want is to add your SNPs in VCF format as an input file to the list of gVCFs. gVCF Genotyper can deal with both gVCF and VCF input so if you have your SNPs in VCF format that might work. I am saying might here because you will probably have to invent some VCF fields that gVCF Genotyper expects to be there.

This second option is a bit of a cumbersome approach and I am still not 100% clear what you are trying to achieve, so just pointing out there there is this option.

dalmiaa commented 2 years ago

Hi @olest - I am trying to, as you correctly noted, force genotype these 500K SNPs genome wide (or 30K SNPs in chromosome 4).

Since these are WGS data, I presume all of the variants would be present. Do you mean that gvcfgenotyper will only pick it up if at least one of the samples carries an ALT allele?

Quick edit: to add more context on what I am trying to achieve - I have a total of ±30K samples, of which I have genotyping array VCFs for ±20K, for the remaining ±10K people for whom I have gVCFs, I am trying to force genotype the array SNPs so as to have a full dataset of ±30K patients and these ±500K SNPs so as to be able to run analyses such as kinship analysis, phasing of the data, polygenic risk scores, etc.

jaredo commented 2 years ago

On Wed, Nov 24, 2021 at 6:28 AM Anupriya Dalmia @.***> wrote:

Since these are WGS data, I presume all of the variants would be present. Do you mean that gvcfgenotyper will only pick it up if at least one of the samples carries an ALT allele?

That is correct, the variant has to be polymorphic in the cohort you are genotyping. For a large cohort, all common variants should be present (say anything with expected minor allele count >= 10) but you will miss rare variants.

olest commented 2 years ago

Hi @dalmiaa let us know if Jared answered your question and I will close the ticket.

dalmiaa commented 2 years ago

Yes @olest, please go ahead.

Thank you, @jaredo!