brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
270 stars 36 forks source link

Somalier underestimates relatedness if low sites overlap #55

Closed fgvieira closed 4 years ago

fgvieira commented 4 years ago

Hi,

I am running some different samples (RNA vs cfDNA) through somalier and one of them came out with a very low relatedness, even though 77 (out of 78) sites are IBS2:

#sample_a   sample_b   relatedness  ibs0  ibs2  hom_conc hets_a  hets_b  shared_hets  hom_alts_a  hom_alts_b  shared_hom_alts  n    x_ibs0  x_ibs2  expected_relatedness
165962      778965     0.101        0     77    0.193    228     556     23           166         866         32               78   0       3       1.0

I think that, even though there are a lot of heterozygote sites, they do not actually overlap (the reason why n=78). On the RNA data probably because of non-expressed genes, and on the cfDNA prob due to the low coverage.

According to somalier's paper, relatedness is calculated as:

(shared-hets(i,j) - 2 * ibs0(i, j)) / min(hets(i), hets(j))

However, in this case, even though one sample has 228 hets and the other 556, they only overlap on 78 snps. So, shouldn't the formula be a bit more like:

(shared-hets(i,j) - 2 * ibs0(i, j)) / min(hets_in_common_pos(i), hets_in_common_pos(j))

where hets_in_common_pos(i) stands for the number of hets in sample "i" among the positions shared ("n"). This change should have no effect when comparing the same type of seq (since the overlap should be quite high) and improve comparisons of different types of seq.

brentp commented 4 years ago

I see what you mean. I am looking into this.

brentp commented 4 years ago

@fgvieira you are absolutely right. would you check this binary and verify that it works for you to fix this case?

somalier.gz

fgvieira commented 4 years ago

@brentp just checked the new version and it fixes the issue I was seeing. for this pair of samples relatedness went from 0.101 to 0.979 :+1: