DaehwanKimLab / hisat-genotype

GNU General Public License v3.0
23 stars 15 forks source link

HISAT-genotype model for HLA allele abundance estimates #28

Closed rtobes closed 3 years ago

rtobes commented 3 years ago

In your last paper about HISAT-genotype (https://pubmed.ncbi.nlm.nih.gov/31375807/) you described the algorithm of assignment of HLA allele abundance based on a strategy of Expectation Maximization:

"HISAT-genotype applies the following statistical model in each of the two steps to find maximum likelihood estimates of abundance through an EM algorithm (33). We previously implemented an EM solution in our centrifuge system (34), and we used a similar algorithm in HISAT-genotype, with modifications to the variable definitions as follows........"

Analyzing the results that HISAT-genotype provides for each HLA-gene it appears that your model considers that it is possible to have any number of alleles per gene following the same model that you applied in Centrifuge system that was a program designed for metagenomics taxonomic abundance profile analysis. In metagenomics you can find any number of different organisms in the sample but in HLA you must define a model in which only 2 alleles per gene are possible.

Please, could you explain me how do you have modeled in HISAT-genotype the restriction to a maximum of 2 alleles when you estimate HLA alleles abundance?

chbe-helix commented 3 years ago

Hi Rtobes!

Thanks for the question. The EM algorithm is not restricted to 2 alleles in HISAT-genotype. As I did not design the algorithm myself, I will try to explain what I understand to be the rational:

In most cases, when there is enough high quality sequencing data, it is not necessary to restrict the EM as it will converge to two naturally. There are cases where multiple alleles are reported at different abundances and in these situations I feel it is disadvantageous to restrict the EM as we would loose information to the confidence of the call. In these cases the user can either use a built in script called hisatgenotype_toolkit parse-results to assist in simplifying the output or limiting the resolution of the call, make their own call based on the information they have available, or attempt to assemble the genes and leverage the diploid phasing algorithm we implemented which does restrict the number of reported alleles.

I hope this helps answer your questions. If you need further information, I'd be happy to discuss further over email or set-up a call. Thanks!

Thanks, Chris

rtobes commented 3 years ago

Hi Chris,

Thanks so much for your response.

The problem is not simplifying the output but to know precisely the meaning of the output to interpret it correctly. I think that it is difficult that the same model fits for the study of HLA allele frequencies in populations and for cases of genotyping one genome.

I also wonder if the value used for setting -k parameter at the HISAT mapping step could be also especially important for this case of one genome genotyping. If I'm right the default value is -k 5 and then, in HLA genes, many reads would not be mapped to all the reference sequences that each read matches (with the same maximal score). The not highly specific model and the not complete mapping to all the reference alleles (when complete multi-mapping is higher than 5) could bias the final genotyping results. I would like to know your opinion and advice about this.

I appreciate so much your kind explanations

Thanks!

Raquel

chbe-helix commented 3 years ago

Hi Raquel,

Fortunately, the EM model we employ is highly useful in both populations and individuals. The model functions on assigning the reads to a particular reference category and calculating the resulting frequencies. For populations, the assumption is each individual contributes equal reads to the pool thus we can interpret the frequencies as allele frequencies in the whole population. For individuals, (as is the case of HISATgenotype) we interpret the frequencies similarly with the user interpreting the results based on the input source.

For example, if the user HLA-types germline whole genome sequencing, we expect either one or two possible allele results (homozygous and heterozygous) with frequencies near 100% or 50%. Conversely, should the user attempt HLA-typing a high ploidy sample, the results can vary and may reflect any genetic duplication that occurs such as a duplication that could result in a 66/33 split in frequencies. Furthermore, as Dr. Daehwan Kim notes, the HLA database (IMGT/HLA) is still far from complete, so for samples that have HLA alleles not present in the database, HISAT-genotype tends to report more than two alleles with high probability (something more than 20%). If you see more than two alleles with such high probability, then it could mean that the sample might include unknown alleles, sequencing data coverage might be low, and/or some sequencing reads might be incorrectly aligned.

Simplistically, we can look at the individual frequencies as the frequencies of the reads.

In regards to your second point, we allow 10 alignments per read and align to a meta-allele-gene-graph which represents all alleles for a gene in a single reference. We can then assign the read to a specific allele based on how it aligns in this graph thus eliminating the need to adjust to a higher -k to capture all alignments.

I hope this helps!

Thanks, Chris

rtobes commented 3 years ago

Thank you

Raquel

jeffr100 commented 1 year ago

I have found that for some samples, when I run "hisatgenotype_toolkit parse-results" I only get one call. Does this mean homozygous, or just not enough reads?