BCF entry point with intensity and contamination checks using BCF for data_catalog usage.

rajwanir commented 2 months ago

Creates BCF entry point for data catalog.
Makes BPM manifest optional (relevant for aggregated file inputs e.g. BCF, VCF or BED)
Adds scripts to compute median intensity and contamination score from BCF.
Separates the intensity checks from contamination checks.
No observable difference with standard GTC input.

jaamarks commented 1 month ago

Hi @rajwanir,

Thank you for the PR! I’ve completed my review of your commits, and now I need to conduct some testing to ensure everything works as expected. I have a few questions and comments that I’d like to summarize here:

Could you please fetch the latest changes from the default branch and rebase (instead of merging)? This approach helps maintain a cleaner history.
I noticed a few instances where the variable name vcf_file was used, and I wondered if it should be bcf_file instead. My main concern is whether this might cause confusion for developers, or if it’s common practice to refer to BCF files as VCF since BCF is essentially the binary version of VCF. What are your thoughts on this?
I have some concerns about splitting the contamination subworkflow. Could you address this comment and discuss the pros and cons of this approach?

Thank you!

jaamarks commented 1 month ago

I am encountering the following error with the pull_b_allele_freq_from_1kg rule. Could you please help address this error message?

[Wed Sep 11 14:21:26 2024]
rule pull_b_allele_freq_from_1kg:
    input: /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf, /DCEG/Projects/DataDelivery/GWAS_Data_Delivery/CONFLUENCE_FB28648_05172021/testing_500samples/manifest/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
    output: sample_level/GSAv3Confluence.AF.abf.txt
    jobid: 0
    resources: mem_mb=19787, disk_mb=19787, tmpdir=/tmp                                                                       
/home/marksja/miniconda3/envs/cgr-dev/bin/python /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py
[E::idx_find_and_load] Could not retrieve index file for '/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf'
Traceback (most recent call last):
  File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 144, in <module>
    main(**defaults)  # type: ignore
  File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 77, in main
    rec.info["GC_SCORE"],
  File "pysam/libcbcf.pyx", line 2598, in pysam.libcbcf.VariantRecordInfo.__getitem__
KeyError: 'Unknown INFO field: GC_SCORE'
[Wed Sep 11 14:21:27 2024]                                                                                                    Error in rule pull_b_allele_freq_from_1kg:
    jobid: 0                                                                                                                      output: sample_level/GSAv3Confluence.AF.abf.txt

Thanks!

jaamarks commented 1 month ago

There are two identical files in the sub_workflows directory now.

idat_intensity.smk

intensity_check.smk


sub_workflows$ ls
contamination.smk  delivery.smk  entry_points.smk  idat_intensity.smk  intensity_check.smk  sample_qc.smk  subject_qc.smk

jaamarks commented 1 month ago

No observable difference with standard GTC input.

I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:

$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1     rs3131972       0       817341  G       A
---
> 1     rs3131972       0       817341  A       G

We see the alleles flipped. Can you address this?

rajwanir commented 1 month ago

No observable difference with standard GTC input.

I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:
$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1     rs3131972       0       817341  G       A
---
> 1     rs3131972       0       817341  A       G
We see the alleles flipped. Can you address this?

This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with --keep-allele-order switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,

vcf/bcf record (CHROM ID REF ALT): chr1 rs3131972 A G should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)): 1 rs3131972 0 817341 G A

bim format is described here: https://www.cog-genomics.org/plink/1.9/formats#bim

Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.

You can reexport plink bed/bim/fam to vcf.gz with the following: plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0 And compare the plink rexported with the orignal input like this: diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)

You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.

Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.

Feel free to let me know if this seems acceptable.

rajwanir commented 1 month ago

I am encountering the following error with the pull_b_allele_freq_from_1kg rule. Could you please help address this error message?

[Wed Sep 11 14:21:26 2024]
rule pull_b_allele_freq_from_1kg:
    input: /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf, /DCEG/Projects/DataDelivery/GWAS_Data_Delivery/CONFLUENCE_FB28648_05172021/testing_500samples/manifest/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
    output: sample_level/GSAv3Confluence.AF.abf.txt
    jobid: 0
    resources: mem_mb=19787, disk_mb=19787, tmpdir=/tmp                                                                       
/home/marksja/miniconda3/envs/cgr-dev/bin/python /mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py
[E::idx_find_and_load] Could not retrieve index file for '/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/samples.bcf'
Traceback (most recent call last):
  File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 144, in <module>
    main(**defaults)  # type: ignore
  File "/mnt/nfs/gigantor/ifs/DCEG/Home/marksja/dev-tests/024/.snakemake/scripts/tmprlcuev93.vcf2abf.py", line 77, in main
    rec.info["GC_SCORE"],
  File "pysam/libcbcf.pyx", line 2598, in pysam.libcbcf.VariantRecordInfo.__getitem__
KeyError: 'Unknown INFO field: GC_SCORE'
[Wed Sep 11 14:21:27 2024]                                                                                                    Error in rule pull_b_allele_freq_from_1kg:
    jobid: 0                                                                                                                      output: sample_level/GSAv3Confluence.AF.abf.txt

Thanks!

270e1055378ac20ada4595a1c3ec8e8aad6969aa should fix this. Shifted from GC_SCORE to IGC field to keep compatability with gtc2vcf workflow.

rajwanir commented 1 month ago

There are two identical files in the sub_workflows directory now.

idat_intensity.smk

intensity_check.smk
sub_workflows$ ls
contamination.smk  delivery.smk  entry_points.smk  idat_intensity.smk  intensity_check.smk  sample_qc.smk  subject_qc.smk

Removed in 9cda9ed5d48c0cea3656f3ec5ce1d29587c08da6

jaamarks commented 1 month ago

You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.

Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.

Thanks for the explanation @rajwanir. Can you open a separate issue for this explaining what you found?

rajwanir commented 1 month ago

You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf. Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.

Thanks for the explanation @rajwanir. Can you open a separate issue for this explaining what you found?

Opened #321 to reference this allele flip issue.

NCI-CGR / GwasQcPipeline

BCF entry point with intensity and contamination checks using BCF for data_catalog usage. #314