WansonChoi / HATK

A collection of modules to process and analyze IMGT-HLA sequences.
26 stars 9 forks source link

Association output only on SNPs #16

Open ciaran-campbell opened 2 years ago

ciaran-campbell commented 2 years ago

Hi there, When I run HATK on one dataset, the association output has only worked for SNPs with rsids. The following error is displayed and no heatmaps are produced? Do you know what could be causing this?

Thanks for the help

image

WansonChoi commented 2 years ago

@ciaran-campbell

Hi, Thank you for your interest in HATK.

It's highly likely that the bMarkerGenerator step failed.

You should check (1) your input dictionaries are well prepared with the IMGT2Seq, (2) your HLA type data goes through the NomenCleaner so that its field resolution is well matched to that of the input dictionaries.

If the problem isn't solved, then could you consider sending your input HLA type and SNP data? (Anonymize samples' label if necessary) I'll try to work on them.

schen2643 commented 2 years ago

Hello!

Thanks a lot for your reply. From our log file, it seems IMGT2Seq and NomenCleaner have run successfully (?) as no error was reported.

[IMGT2Seq.py]: Multiprocessing.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA B.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA DPA1.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA DQA1.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA DQB1.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA A.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA C.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA DPB1.

[ProcessIMGT.py]: Generating sequence information dictionary for HLA DRB1.
Warning: Variants 'HLA_A*01:02' and 'HLA_A*01:01' have the same position.
Warning: Variants 'HLA_A*02:01' and 'HLA_A*01:02' have the same position.
Warning: Variants 'HLA_A*02:05' and 'HLA_A*02:01' have the same position.
2740 more same-position warnings: see log file.

[HLA_Study.py]: IMGT2Seq result : 
< IMGT2Sequence(Newly generated.) >
- HLA Amino Acids : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_DICTIONARY_AA.hg18.imgt3320
- HLA SNPs : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_DICTIONARY_SNPS.hg18.imgt3320
- HLA Allele Table : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_ALLELE_TABLE.imgt3320.hat
- Maptables for heatmap : 
   A   : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_A.hg18.imgt3320.txt
   B   : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_B.hg18.imgt3320.txt
   C   : /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_C.hg18.imgt3320.txt
   DPA1: /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_DPA1.hg18.imgt3320.txt
   DPB1: /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_DPB1.hg18.imgt3320.txt
   DQA1: /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_DQA1.hg18.imgt3320.txt
   DQB1: /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_DQB1.hg18.imgt3320.txt
   DRB1: /stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/HLA_MAPTABLE_DRB1.hg18.imgt3320.txt

[HLA_Study.py]: Given HPED file('/stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/eur_epi_cc_unrel_qc_chr1-23-updated-chr6+1000G.MHC.HLA_IMPUTATION_OUT.hped') is to be processed by NomenCleaner.

[NomenCleaner.py]: Generating CHPED with Maximum 2 fields HLA alleles.

[bMarkerGenerator.py]: Making Reference Panel for "/stanley/genetics/analysis/epi25/sc/ilae_gwas/hla/eur_hla/hg18/eur_epi_cc_unrel_qc_chr1-23-updated-chr6+1000G_hatk_hg18_isGGE_rm-ibd"

[1] Generating Amino acid(AA)sequences from HLA types.
[2] Encoding Amino acids positions.
[3] Encoding HLA alleles.
[4] Generating DNA(SNPS) sequences from HLA types.
[5] Encoding SNP positions.
[6] Extracting founders.
[7] Merging SNP, HLA, and amino acid datasets.
[8] Performing quality control.
[9] Making reference panel for HLA-AA,SNPS,HLA and Normal variants(SNPs) is Done!

though we did not run IMGT2Seq and NomenCleaner separately - we ran the pipeline following the example in the documentation (see below), which says 'This command will implement (1) IMGT2Seq, (2) NomenCleaner, (3) bMarkerGenerator, (4) HLA_Analyzer(Association Test - logistic regression), (5) Manhattan Plot and (6) Heatmap Plot, which are the minimal components for HLA fine-mapping analysis.'

python3 HATK.py \
    --variants ${pop_resdir}/${pfx}.COPY.LiftDown_hg18 \
    --hped ${pop_resdir}/${pfx_o2}.MHC.HLA_IMPUTATION_OUT.hped \
    --2field \
    --out ${pop_resdir}/hg18/${pfx_o3} \
    --imgt 3320 \
    --hg 18 \
    --imgt-dir example/IMGTHLA3320 \
    --pheno ${phenodir}/eur_epi_cc${sub}_gwas_pheno_covar_${epi_type}.hatk.tsv \
    --pheno-name ${epi_type} \
    --covar ${phenodir}/eur_epi_cc${sub}_gwas_pheno_covar_${epi_type}.hatk.tsv \
    --covar-name SEX,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --multiprocess 2

So is there any difference by running the above script vs running IMGT2Seq and NomenCleaner separately? The complete log file is also attached. Would appreciate your suggestions! And we are happy to send anonymized data for further examination. Thanks!

test_epi.log

WansonChoi commented 2 years ago

@schen2643

Hi, schen2643. Thank you for your interest in HATK.

No. there is no difference.

The 'bMarkerGenerator' is practically the main process for HLA fine-mapping while it requires some preprocessings for data from the IPD-IMGT/HLA database and raw HLA type data. The IMGT2Seq and NomenCleaner are for this. In other words, Once those required preprocessings are done, users don't have to repeat them and can save time. The IMGT2Seq can be skipped with the '--dict-AA' and '--dict-SNPS' arguments and the NomenCleaner can be skipped with the '--chped' argument, where those arguments essentailly take output from those preprocessing steps.

If Logistic regression result and Manhattan plot were generated well, then there would be no significant problem. In your log file, plotting heatmap for 8 HLA genes failed anyway. If you need help related to this, Please consider sending the anonymized genotype and HLA type data.

PoojaMiddha commented 1 year ago

@schen2643 & @ciaran-campbell Were you able to identify the issue. I am running into the same issue. And in the association file I see results for SNPs but NA for HLA markers and AAs. Any help is much appreciated.