Special Variant Calling for HLA Genes Not Performed

DarioS commented 3 years ago

I noticed that the collaborators are using the somatic VCF files to look for somatic mutations in the HLA genes. These genes are known to be problematic for alignments to the human reference genome since there is a large variety of alleles in the human population. A while back, PolySolver was developed to tackle this challenge. Comprehensive Analysis of Cancer-associated Somatic Mutations in Class I HLA genes, Nature Biotechnology, 2015.

The human reference genome, however, has a single sequence for each HLA gene and would likely misrepresent the true alleles in the individual, thereby causing suboptimal alignments. Consequently, to accurately detect somatic mutations in the HLA genes, one needs to first accurately align all reads originating from this region in both the tumor and matched normal samples and only then to apply somatic mutation detection tools. To this end, we developed the algorithm Polysolver (polymorphic loci resolver), which enables high-precision HLA typing and a subsequent mutation detection pipeline that uses the inferred alleles as a basis for high-fidelity detection of mutations in HLA genes.

Such a strategy could be useful if implemented as standard and applied to the reads already extracted to ../HLA_fastq.

Also, the pipeline currently uses bwakit but it seems to treat all of the HLA contigs as independent sequences, whereas they are alternative sequences of the same genomic region. So, there are many false positive reported variants:

$ zgrep ^HLA.*PASS CSCC_0001-M1_CSCC_0001-B1.filtered.vcf.gz | cut -d $'\t' -f 1-7 | head -n 30
HLA-A*01:01:01:01       1554    .       G       GACAC   .       PASS
HLA-A*01:01:01:02N      1471    .       G       GACAC   .       PASS
HLA-A*01:01:38L 1554    .       G       GACAC   .       PASS
HLA-A*01:02     1554    .       G       GACAC   .       PASS
HLA-A*01:03     1554    .       G       GACAC   .       PASS
HLA-A*01:04N    1470    .       G       GACAC   .       PASS
HLA-A*01:09     1254    .       G       GACAC   .       PASS
HLA-A*01:11N    1554    .       G       GACAC   .       PASS
HLA-A*01:14     1254    .       G       GACAC   .       PASS
HLA-A*01:16N    1276    .       G       GACAC   .       PASS
HLA-A*01:20     1254    .       G       GACAC   .       PASS
HLA-A*23:01:01  247     .       G       C       .       PASS
HLA-A*23:01:01  294     .       G       C       .       PASS
HLA-A*24:02:01:01       247     .       G       C       .       PASS
HLA-A*24:02:01:01       294     .       G       C       .       PASS
HLA-A*24:02:01:02L      247     .       G       C       .       PASS
HLA-A*24:02:01:02L      294     .       G       C       .       PASS
HLA-A*24:02:03Q 233     .       G       C       .       PASS
HLA-A*24:02:03Q 280     .       G       C       .       PASS
HLA-A*24:02:10  227     .       G       C       .       PASS
HLA-A*24:02:10  274     .       G       C       .       PASS
HLA-A*24:03:01  247     .       G       C       .       PASS
HLA-A*24:03:01  294     .       G       C       .       PASS
HLA-A*24:07:01  247     .       G       C       .       PASS
HLA-A*24:07:01  294     .       G       C       .       PASS
HLA-A*24:08     247     .       G       C       .       PASS
HLA-A*24:08     294     .       G       C       .       PASS
HLA-A*24:09N    247     .       G       C       .       PASS
HLA-A*24:09N    294     .       G       C       .       PASS
HLA-A*24:10:01  247     .       G       C       .       PASS

If a human genome is diploid then there should be no more than two HLA-A alleles for a person.

tracychew commented 3 years ago

Hi Dario, thanks for sharing the article, I'll look into this.

DarioS commented 3 years ago

PolySolver is sadly an undocumented Docker container. Also, you could add a supportive comment to my Mutect 2 feature request.

Sydney-Informatics-Hub / Somatic-ShortV

Special Variant Calling for HLA Genes Not Performed #1