Closed rajwanir closed 1 month ago
Hi @rajwanir,
Thank you for the PR! I’ve completed my review of your commits, and now I need to conduct some testing to ensure everything works as expected. I have a few questions and comments that I’d like to summarize here:
Thank you!
I am encountering the following error with the pull_b_allele_freq_from_1kg
rule. Could you please help address this error message?
[Wed Sep 11 14:21:26 2024]
rule pull_b_allele_freq_from_1kg:
input: /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf, /DCEG/Projects/DataDelivery/GWAS_Data_Delivery/CONFLUENCE_FB28648_05172021/testing_500samples/manifest/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
output: sample_level/GSAv3Confluence.AF.abf.txt
jobid: 0
resources: mem_mb=19787, disk_mb=19787, tmpdir=/tmp
/home/marksja/miniconda3/envs/cgr-dev/bin/python /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py
[E::idx_find_and_load] Could not retrieve index file for '/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf'
Traceback (most recent call last):
File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 144, in <module>
main(**defaults) # type: ignore
File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 77, in main
rec.info["GC_SCORE"],
File "pysam/libcbcf.pyx", line 2598, in pysam.libcbcf.VariantRecordInfo.__getitem__
KeyError: 'Unknown INFO field: GC_SCORE'
[Wed Sep 11 14:21:27 2024] Error in rule pull_b_allele_freq_from_1kg:
jobid: 0 output: sample_level/GSAv3Confluence.AF.abf.txt
Thanks!
There are two identical files in the sub_workflows directory now.
sub_workflows$ ls
contamination.smk delivery.smk entry_points.smk idat_intensity.smk intensity_check.smk sample_qc.smk subject_qc.smk
No observable difference with standard GTC input.
I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:
$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1 rs3131972 0 817341 G A
---
> 1 rs3131972 0 817341 A G
We see the alleles flipped. Can you address this?
No observable difference with standard GTC input.
I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:
$ diff samples.bim ../../020a/sample_level/samples.bim | head 5c5 < 1 rs3131972 0 817341 G A --- > 1 rs3131972 0 817341 A G
We see the alleles flipped. Can you address this?
This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with --keep-allele-order
switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,
vcf/bcf record (CHROM ID REF ALT):
chr1 rs3131972 A G
should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)):
1 rs3131972 0 817341 G A
bim format is described here: https://www.cog-genomics.org/plink/1.9/formats#bim
Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.
You can reexport plink bed/bim/fam to vcf.gz with the following:
plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0
And compare the plink rexported with the orignal input like this:
diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)
You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.
Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.
Feel free to let me know if this seems acceptable.
I am encountering the following error with the
pull_b_allele_freq_from_1kg
rule. Could you please help address this error message?[Wed Sep 11 14:21:26 2024] rule pull_b_allele_freq_from_1kg: input: /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf, /DCEG/Projects/DataDelivery/GWAS_Data_Delivery/CONFLUENCE_FB28648_05172021/testing_500samples/manifest/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz output: sample_level/GSAv3Confluence.AF.abf.txt jobid: 0 resources: mem_mb=19787, disk_mb=19787, tmpdir=/tmp /home/marksja/miniconda3/envs/cgr-dev/bin/python /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py [E::idx_find_and_load] Could not retrieve index file for '/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf' Traceback (most recent call last): File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 144, in <module> main(**defaults) # type: ignore File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 77, in main rec.info["GC_SCORE"], File "pysam/libcbcf.pyx", line 2598, in pysam.libcbcf.VariantRecordInfo.__getitem__ KeyError: 'Unknown INFO field: GC_SCORE' [Wed Sep 11 14:21:27 2024] Error in rule pull_b_allele_freq_from_1kg: jobid: 0 output: sample_level/GSAv3Confluence.AF.abf.txt
Thanks!
270e1055378ac20ada4595a1c3ec8e8aad6969aa should fix this. Shifted from GC_SCORE to IGC field to keep compatability with gtc2vcf workflow.
There are two identical files in the sub_workflows directory now.
- idat_intensity.smk
- intensity_check.smk
sub_workflows$ ls contamination.smk delivery.smk entry_points.smk idat_intensity.smk intensity_check.smk sample_qc.smk subject_qc.smk
Removed in 9cda9ed5d48c0cea3656f3ec5ce1d29587c08da6
You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.
Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.
Thanks for the explanation @rajwanir. Can you open a separate issue for this explaining what you found?
You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf. Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.
Thanks for the explanation @rajwanir. Can you open a separate issue for this explaining what you found?
Opened #321 to reference this allele flip issue.