> > No observable difference with standard GTC input.
I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:
$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1 rs3131972 0 817341 G A
---
> 1 rs3131972 0 817341 A G
We see the alleles flipped. Can you address this?
This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with --keep-allele-order switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,
vcf/bcf record (CHROM ID REF ALT):
chr1 rs3131972 A G
should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)):
1 rs3131972 0 817341 G A
Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.
You can reexport plink bed/bim/fam to vcf.gz with the following:
plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0
And compare the plink rexported with the orignal input like this:
diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)
You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.
Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.
Feel free to let me know if this seems acceptable.
Verify that gtc-to-vcf also includes the --keep-allele-orderswitch.
Also may be worth bring up in meeting with all users to ensure which allele order would they prefer to have in the output. REF/ALT manner or MAJOR/MINOR. If --keep-allele-order is not added at the import, plink will switch alleles depending on the dataset cohort.
This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with
--keep-allele-order
switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,vcf/bcf record (CHROM ID REF ALT):
chr1 rs3131972 A G
should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)):1 rs3131972 0 817341 G A
bim format is described here: https://www.cog-genomics.org/plink/1.9/formats#bim
Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.
You can reexport plink bed/bim/fam to vcf.gz with the following:
plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0
And compare the plink rexported with the orignal input like this:diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)
You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.
Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.
Feel free to let me know if this seems acceptable.
Originally posted by @rajwanir in https://github.com/NCI-CGR/GwasQcPipeline/issues/314#issuecomment-2349538855