NCI-CGR / GwasQcPipeline

The CGR GWAS QC processing workflow.
https://nci-cgr.github.io/GwasQcPipeline/
MIT License
0 stars 2 forks source link

When importing bcf/vcf to plink, alleles can get flipped. #321

Open rajwanir opened 1 week ago

rajwanir commented 1 week ago
          > > No observable difference with standard GTC input.

I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:

$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1     rs3131972       0       817341  G       A
---
> 1     rs3131972       0       817341  A       G

We see the alleles flipped. Can you address this?

This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with --keep-allele-order switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,

vcf/bcf record (CHROM ID REF ALT): chr1 rs3131972 A G should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)): 1 rs3131972 0 817341 G A

bim format is described here: https://www.cog-genomics.org/plink/1.9/formats#bim

Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.

You can reexport plink bed/bim/fam to vcf.gz with the following: plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0 And compare the plink rexported with the orignal input like this: diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)

You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.

Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.

Feel free to let me know if this seems acceptable.

Originally posted by @rajwanir in https://github.com/NCI-CGR/GwasQcPipeline/issues/314#issuecomment-2349538855

rajwanir commented 1 week ago

We will need to:

  1. Verify that gtc-to-vcf also includes the --keep-allele-orderswitch.
  2. Also may be worth bring up in meeting with all users to ensure which allele order would they prefer to have in the output. REF/ALT manner or MAJOR/MINOR. If --keep-allele-order is not added at the import, plink will switch alleles depending on the dataset cohort.