I noticed that the collaborators are using the somatic VCF files to look for somatic mutations in the HLA genes. These genes are known to be problematic for alignments to the human reference genome since there is a large variety of alleles in the human population. A while back, PolySolver was developed to tackle this challenge. Comprehensive Analysis of Cancer-associated Somatic Mutations in Class I HLA genes, Nature Biotechnology, 2015.
The human reference genome, however, has a single sequence for each HLA gene and would likely misrepresent the true alleles in the individual, thereby causing suboptimal alignments. Consequently, to accurately detect somatic mutations in the HLA genes, one needs to first accurately align all reads originating from this region in both the tumor and matched normal samples and only then to apply somatic mutation detection tools. To this end, we developed the algorithm Polysolver (polymorphic loci resolver), which enables high-precision HLA typing and a subsequent mutation detection pipeline that uses the inferred alleles as a basis for high-fidelity detection of mutations in HLA genes.
Such a strategy could be useful if implemented as standard and applied to the reads already extracted to ../HLA_fastq.
Also, the pipeline currently uses bwakit but it seems to treat all of the HLA contigs as independent sequences, whereas they are alternative sequences of the same genomic region. So, there are many false positive reported variants:
$ zgrep ^HLA.*PASS CSCC_0001-M1_CSCC_0001-B1.filtered.vcf.gz | cut -d $'\t' -f 1-7 | head -n 30
HLA-A*01:01:01:01 1554 . G GACAC . PASS
HLA-A*01:01:01:02N 1471 . G GACAC . PASS
HLA-A*01:01:38L 1554 . G GACAC . PASS
HLA-A*01:02 1554 . G GACAC . PASS
HLA-A*01:03 1554 . G GACAC . PASS
HLA-A*01:04N 1470 . G GACAC . PASS
HLA-A*01:09 1254 . G GACAC . PASS
HLA-A*01:11N 1554 . G GACAC . PASS
HLA-A*01:14 1254 . G GACAC . PASS
HLA-A*01:16N 1276 . G GACAC . PASS
HLA-A*01:20 1254 . G GACAC . PASS
HLA-A*23:01:01 247 . G C . PASS
HLA-A*23:01:01 294 . G C . PASS
HLA-A*24:02:01:01 247 . G C . PASS
HLA-A*24:02:01:01 294 . G C . PASS
HLA-A*24:02:01:02L 247 . G C . PASS
HLA-A*24:02:01:02L 294 . G C . PASS
HLA-A*24:02:03Q 233 . G C . PASS
HLA-A*24:02:03Q 280 . G C . PASS
HLA-A*24:02:10 227 . G C . PASS
HLA-A*24:02:10 274 . G C . PASS
HLA-A*24:03:01 247 . G C . PASS
HLA-A*24:03:01 294 . G C . PASS
HLA-A*24:07:01 247 . G C . PASS
HLA-A*24:07:01 294 . G C . PASS
HLA-A*24:08 247 . G C . PASS
HLA-A*24:08 294 . G C . PASS
HLA-A*24:09N 247 . G C . PASS
HLA-A*24:09N 294 . G C . PASS
HLA-A*24:10:01 247 . G C . PASS
If a human genome is diploid then there should be no more than two HLA-A alleles for a person.
I noticed that the collaborators are using the somatic VCF files to look for somatic mutations in the HLA genes. These genes are known to be problematic for alignments to the human reference genome since there is a large variety of alleles in the human population. A while back, PolySolver was developed to tackle this challenge. Comprehensive Analysis of Cancer-associated Somatic Mutations in Class I HLA genes, Nature Biotechnology, 2015.
Such a strategy could be useful if implemented as standard and applied to the reads already extracted to
../HLA_fastq
.Also, the pipeline currently uses bwakit but it seems to treat all of the HLA contigs as independent sequences, whereas they are alternative sequences of the same genomic region. So, there are many false positive reported variants:
If a human genome is diploid then there should be no more than two HLA-A alleles for a person.