10XGenomics / scHLAcount

Count HLA alleles in single-cell RNA-seq data
MIT License
58 stars 12 forks source link

How does scHLAcount determine a pairs #15

Closed tuqiang2014 closed 3 years ago

tuqiang2014 commented 3 years ago

Hello,

In scHLAcount outs files pseudoaligner_tmp/pairs.tsv, record the allele_1 allele_2 and reads_explained

$ head pairs.tsv 
allele_1    allele_2    reads_explained
A*26:07:01  A*03:71 0.25329137721155076
A*01:01:87  A*31:20 0.25248054475439036
A*31:20 A*11:01:06  0.2514857873662506
A*26:07:01  A*01:01:87  0.25013054312401
A*26:07:01  A*31:20 0.24427154232024006
B*13:114    B*48:01:01:04   0.4847153724126729
B*13:114    B*15:12:03  0.4660226489127016
B*13:114    B*35:361    0.4633328970625956
B*13:114    B*57:12 0.4559425408898631

I want to know how determine a pairs in scHLAcount ??

cdarby commented 3 years ago

This output file appears when you do not provide genotypes and scHLAcount runs its genotyping algorithm. Here is some text from our README page:

For each of the eight HLA genes included in the graph, we rank the alleles of that gene by weight in the EM. To determine the diploid genotype, we consider all pairs of alleles that explained at least 100 reads (a pair of alleles explains a read if the read pseudoaligned to an equivalence class containing either allele sequence).
...
For each gene, we report the top 5 allele pairs ranked by number of reads explained, and the top 10 alleles ranked by EM weight in an auxiliary output file.

So basically, for each gene, we first filter all the possible alleles based on EM score to select top candidates. Then we examine all pairs of all the candidates. The file you pasted contains the top 5 pairs, ranked by number of reads explained. The purpose of this auxiliary output file is to help you gauge whether you trust the genotype scHLAcount called. For example, for your HLA-A results you pasted, the top 5 genotypes reported explained a similar number of reads, indicating that there is considerable ambiguity and perhaps there is not enough evidence in your data to call genotypes reliably based on the scHLAcount algorithm. Just to give a made-up example, if the first genotype explained 90% of reads and the other 4 explained, say, less than 50%, you would be more confident that the first genotype is correct compared to the proposed alternatives.