WansonChoi / CookHLA

An accurate and efficient HLA imputation method.
25 stars 9 forks source link

advice on using the Han panel #11

Open pkuerten opened 2 years ago

pkuerten commented 2 years ago

I am trying to use the Han reference panel for which the authors have provided .markers and .bgl files. Any advice on how to generate the other reference files needed by cookHLA would be most helpful. Most of the file conversion tools I have tried complain about indels and A/P markers. Thank you for your guidance.

Detao-Zhang commented 2 years ago

Hi,

I've been doing something similar these past two weeks. Other files can be generated with plink and beagle utilities from the phased .bgl and .marker files. You need to convert the .bgl file to .ped format firstly and then to the plink binaries. There are also some small tweaks.

But according to the CookHLA paper, the author applied QC to the panel which left 9773 individuals. I did checked the appearances of the “presence” in the HLA binary markers but only ~200 samples were filtered out. Could the author describe the QC procedure in detail and I can then the update my panel. Thank you!

Detao

Aamiralizai commented 2 years ago

Hi Detao-Zhang Can you please explain more about the conversion of Han reference panel to the CookHLA format, How can the .bgl and marker reference data be use for the CookHLA

WansonChoi commented 2 years ago

@pkuerten @Detao-Zhang @Aamiralizai

A SNP2HLA-format reference panel consists of 4 types of markers. (1) SNP marker(ex. 'rs1143146'), (2) Amino acid marker(ex. 'AAA-22_30018317_V'), (3) HLA marker(ex. 'HLA_A_0101'), and (4) Intragenic SNP marker(ex. 'SNP_A_30019207_A').

[Marker QC]

The raw Han Chinese panel has the SNP markers(the (1)) labeled with their BPs, not dbSNP labels. We first tried to find dbSNP labels for those SNP markers. After converting the raw beagle file('HAN.MHC.reference.panel.bgl.gz') to PLINK format file with the 'beagle2linkage.jar' utility module(https://faculty.washington.edu/browning/beagle_utilities/utilities.html), we matched(i.e. inner-join) the SNP markers to the dbSNP information based on the hg19. Here, we did liftover from hg18 to hg19 to the SNP markers of the Han Chinese panel temporarily because the dbSNP data was given as hg19. Then, we replaced the BP labels with the found dbSNP labels (while the panel is hg18 as it is). # of markers was reduced from 29948 to 27792 after this matching job. I attach the matching result('HAN.MHC.reference.panel.hg19.LabelTable.txt') in the zip file.

After the replacement, we excluded one amino acid marker, 'AA_A_272_30020145_X', which caused an error in the MakeGeneticMap. (27792 -> 27791)

Also, we excluded 5-digit HLA markers such as 'HLA_A_11110', 'HLA_DPB1_10501', ..., etc. (11 markers) The final # of markers is 27780.

[Sample QC]

We kept samples whose 2-field(i.e. 4-digit) HLA types for all 8 HLA genes are completely available. In the raw Han Chinese panel, some samples have only one HLA allele for a HLA gene. For example,

sample1 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample2 : (0101, 0103), (1503, 2706), (1504, NA), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample3 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, NA), (0105, 0301), (0202, NA), (0801, 1001) sample4 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (NA, NA), (0105, 0301), (0202, NA), (0801, 1001)

We kept only samples like the sample1. We removed samples whose HLA genes have any NA like the sample2, 3, and 4. The final # of samples is 9773.

I attach the final '*.bim' and '*.fam' files in the zip file. ('HAN_China_v6.NoWeird.{bim,fam}')

[File processing]

As mentioned above, we mainly used the PLINK and Beagle utility modules.

(1) PLINK Genotype files. (*.{bed,bim,fam}) We used the '--make-bed' argument to generate new subsetted PLINK genotype files. Here, we specified the sample list to keep or remove by additionally passing the '--keep/remove' arguments, and the marker list to extract or exclude by passing the '--extract/exclude' arguments. We also passed the '--keep-allele-order' to keep ref/alt allele relationship intact.

(2) PLINK Allele frequency file. (*.FRQ.frq) We used the '--freq' argument to generate the allele frequency of the above subsetted PLINK genotype files. Likewise, we also passed the '--keep-allele-order' to keep ref/alt allele relationship intact. We renamed the output file to have the '*.FRQ.frq' suffix.

(3) Beagle files. (*.{bgl.phased,markers}) To keep phased information of the raw Han Chinese panel intact, we used the 'Pandas' python package to subset the beagle file ourselves. After loading the beagle file with the pandas 'read_csv()' function, we manually subsetted samples(columns) and markers(rows) mainly with 'isin()', 'iloc[]', and 'loc[]' functions. We named the subsetted output beagle file to have the '*.bgl.phased' suffix. Likewise, we subsetted the beagle markers with the Pandas package.

[Encoding(Allele character) Error]

Related to the 'P/A' encoding error, we deceived the beagle utility or other software by changing the encoding to 'G/C' or 'G/A' encodings. Making another copy of the original files, we applied that alternative encoding and gave them as input to those software. In output files, we changed back to the original encoding.

You might encounter similar errors with duplicated base positions because some software don't allow the duplicated BPs. We applied a similar strategy by allocating consecutive BPs temporarily.

HAN_China_CookHLA.zip

pyanne2000 commented 2 years ago

@pkuerten @Detao-Zhang @Aamiralizai

A SNP2HLA-format reference panel consists of 4 types of markers. (1) SNP marker(ex. 'rs1143146'), (2) Amino acid marker(ex. 'AAA-22_30018317_V'), (3) HLA marker(ex. 'HLA_A_0101'), and (4) Intragenic SNP marker(ex. 'SNP_A_30019207_A').

[Marker QC]

The raw Han Chinese panel has the SNP markers(the (1)) labeled with their BPs, not dbSNP labels. We first tried to find dbSNP labels for those SNP markers. After converting the raw beagle file('HAN.MHC.reference.panel.bgl.gz') to PLINK format file with the 'beagle2linkage.jar' utility module(https://faculty.washington.edu/browning/beagle_utilities/utilities.html), we matched(i.e. inner-join) the SNP markers to the dbSNP information based on the hg19. Here, we did liftover from hg18 to hg19 to the SNP markers of the Han Chinese panel temporarily because the dbSNP data was given as hg19. Then, we replaced the BP labels with the found dbSNP labels (while the panel is hg18 as it is). # of markers was reduced from 29948 to 27792 after this matching job. I attach the matching result('HAN.MHC.reference.panel.hg19.LabelTable.txt') in the zip file.

After the replacement, we excluded one amino acid marker, 'AA_A_272_30020145_X', which caused an error in the MakeGeneticMap. (27792 -> 27791)

Also, we excluded 5-digit HLA markers such as 'HLA_A_11110', 'HLA_DPB1_10501', ..., etc. (11 markers) The final # of markers is 27780.

[Sample QC]

We kept samples whose 2-field(i.e. 4-digit) HLA types for all 8 HLA genes are completely available. In the raw Han Chinese panel, some samples have only one HLA allele for a HLA gene. For example,

sample1 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample2 : (0101, 0103), (1503, 2706), (1504, NA), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample3 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, NA), (0105, 0301), (0202, NA), (0801, 1001) sample4 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (NA, NA), (0105, 0301), (0202, NA), (0801, 1001)

We kept only samples like the sample1. We removed samples whose HLA genes have any NA like the sample2, 3, and 4. The final # of samples is 9773.

I attach the final '.bim' and '.fam' files in the zip file. ('HAN_China_v6.NoWeird.{bim,fam}')

[File processing]

As mentioned above, we mainly used the PLINK and Beagle utility modules.

(1) PLINK Genotype files. (*.{bed,bim,fam}) We used the '--make-bed' argument to generate new subsetted PLINK genotype files. Here, we specified the sample list to keep or remove by additionally passing the '--keep/remove' arguments, and the marker list to extract or exclude by passing the '--extract/exclude' arguments. We also passed the '--keep-allele-order' to keep ref/alt allele relationship intact.

(2) PLINK Allele frequency file. (.FRQ.frq) We used the '--freq' argument to generate the allele frequency of the above subsetted PLINK genotype files. Likewise, we also passed the '--keep-allele-order' to keep ref/alt allele relationship intact. We renamed the output file to have the '.FRQ.frq' suffix.

(3) Beagle files. (.{bgl.phased,markers}) To keep phased information of the raw Han Chinese panel intact, we used the 'Pandas' python package to subset the beagle file ourselves. After loading the beagle file with the pandas 'read_csv()' function, we manually subsetted samples(columns) and markers(rows) mainly with 'isin()', 'iloc[]', and 'loc[]' functions. We named the subsetted output beagle file to have the '.bgl.phased' suffix. Likewise, we subsetted the beagle markers with the Pandas package.

[Encoding(Allele character) Error]

Related to the 'P/A' encoding error, we deceived the beagle utility or other software by changing the encoding to 'G/C' or 'G/A' encodings. Making another copy of the original files, we applied that alternative encoding and gave them as input to those software. In output files, we changed back to the original encoding.

You might encounter similar errors with duplicated base positions because some software don't allow the duplicated BPs. We applied a similar strategy by allocating consecutive BPs temporarily.

HAN_China_CookHLA.zip

Hi, I am still quite confused about how did you get this kind of information. I went through the paper as well as the supplementary files and I didn't see this kind of information. '''' [Sample QC] We kept samples whose 2-field(i.e. 4-digit) HLA types for all 8 HLA genes are completely available. In the raw Han Chinese panel, some samples have only one HLA allele for a HLA gene. For example,

sample1 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample2 : (0101, 0103), (1503, 2706), (1504, NA), (0103, 0201), (2101, 2301), (0105, 0301), (0202, 0319), (0801, 1001) sample3 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (2101, NA), (0105, 0301), (0202, NA), (0801, 1001) sample4 : (0101, 0103), (1503, 2706), (1504, 1202), (0103, 0201), (NA, NA), (0105, 0301), (0202, NA), (0801, 1001)

We kept only samples like the sample1. We removed samples whose HLA genes have any NA like the sample2, 3, and 4. The final # of samples is 9773. '''' @WansonChoi

Aamiralizai commented 2 years ago

Hello @pyanne2000
we requested @WansonChoi to guide us how they used the HAN MHC reference data, he uploaded this tutorial about how to they did the QC and convert HAN MHC reference data to the CookHLA reference format. If you wanna use HAN MHC reference data then follow the above mention tutorial to convert the HAN MHC bgl format to CookHLA reference format. thses are the excluded markeres AA_A_272_30020145_X HLA_C_03100 HLA_B_15220 HLA_DRB1_14141 HLA_DQB1_03100 HLA_DPB1_10401 HLA_DPB1_10501 HLA_DPB1_10601 HLA_DPB1_10701 HLA_DPB1_13501 HLA_DPB1_13801 HLA_A_11110

while to remove the samples as mentioned above. use sample IDs from this "HAN_China_v6.NoWeird.fam" and and SNPs rs ID from the bam file you will get the final sample size and number on SNPs as mentioned in above tutorial

pyanne2000 commented 2 years ago

@Aamiralizai Yeah, I understand what you are talking about. But what I am confused with is why the author filtered those samples out? I mean, we do not have that kind of information. --> the HLA types for the individuals.

Detao-Zhang commented 2 years ago

@WansonChoi Thanks for sharing your QC procedure. I only removed samples like sample4 and didn't label the SNPs with dbSNP dataset, so my results are different from yours. Thanks again for your brilliant work here!

@Aamiralizai Sorry I didn't come back to this issue and missed your question before.

Detao-Zhang commented 2 years ago

@pyanne2000 We are converting the original HAN MHC reference panel (http://gigadb.org/dataset/100156) to CookHLA format here. The author filtered out some samples in the panel because they do not have two alleles for a HLA gene. You can find individual HLA types in the original Beagle panel file.

gitterspec commented 10 months ago

I've been doing something similar these past two weeks. Other files can be generated with plink and beagle utilities from the phased .bgl and .marker files. You need to convert the .bgl file to .ped format firstly and then to the plink binaries. There are also some small tweaks.

I'm trying to do the same here. After running beagle2linkage, I tried make-bed to create binaries, but am missing a map file. Can you advise on how to generate this? Thank you.

gitterspec commented 10 months ago

Adding @WansonChoi in case he can share some advice. Thank you by the way for the incredible work.