FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data
BSD 3-Clause "New" or "Revised" License
190 stars 76 forks source link

Ambiguous allele combinations or ambiguous results in general #22

Closed messersc closed 8 years ago

messersc commented 8 years ago

Hello all,

I was wondering how OptiType handles ambiguous results. I would expect that the solver returns something along the lines of "no single best solution found" or similar. Or will all best results be reported?

Further, is this a problem at all? I.e. you have benchmarked with quite a lot of datasets, have you ever seen such a case or is too rare to worry about? I just saw the -e switch, so I can't say yet for our benchmarks.

Some background

From Ambiguous allele combinations in HLA Class I and Class II sequence-based typing: when precise nucleotide sequencing leads to imprecise allele identification http://dx.doi.org/10.1186%2F1479-5876-2-30

However, one of the inherent problems with this typing method is the interpretation of ambiguous allele combinations which occur when two or more different allele combinations produce identical sequences.

Example: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC517951/figure/F1/

The complete list can be found here: http://www.ebi.ac.uk/ipd/imgt/hla/ambig.html

andras86 commented 8 years ago

Hi!

Some of these cases are solvable by OptiType, others are not. For example, let's take a look at the very first ambiguous combination entry in the IMGT xls. I extracted the relevant segment towards the end of exon2:

A*01:01:01:01 ----------------------
A*01:13       -G----------------T-G-
versus
A*01:17       -G--------------------
A*01:28       ------------------T-G-

These would be indistinguishabe in a simple alignment without phasing, but OptiType would have no problem with it unless there were no reads that spanned the distinguishing positions. Reads covering this segment from any of these four alleles would never be assigned to any of the other three. Therefore it wouldn't confuse OptiType.

Of course if the distinguishing positions were so far apart that not even paired-end reads could span them, OT would be out of luck and it would be impossible to deduce the correct pair from the data with any method.

OptiType doesn't attempt to solve this and it just reports one solution (as do ILP solvers). Indeed one can use the -e enumeration option and check if the first two (or more) results are tied. It's implemented by excluding specific (i.e. previously found) solutions with additional constraints.

Ties happen sometimes, and they may not only be due to indistinguishable combinations à la IMGT list but due to no coverage on some positions. The 1000 Genomes benchmark with very spotty coverage on HLA had multiple of these but I can't give you any numbers right now. I'm considering tackling this issue and at least provide a warning in the future.

Oh, and since OT only considers exons 2 and 3 (since the vast majority of alleles simply have no sequence information other than them), ambiguities due to variants outside the binding groove are lost on it altogether.