biosinodx / SCcaller

Single Cell Caller (SCcaller) - Identify single nucleotide variations (SNVs) from single cell sequencing data
GNU Affero General Public License v3.0
34 stars 14 forks source link

variant allele fraction for homo-alt genotype is always 1? #3

Closed noushinquazi closed 5 years ago

noushinquazi commented 6 years ago

I just want to reconcile my understanding of the paper's math with the code: in the method sc_caller, when calculating the likelihood of the homo-alt genotype, you are always setting the variant allele fraction to 1. however, if there are actually more reference reads, wouldn't the variant allele fraction be 0 instead? This is what the paper says as well.

biosinodx commented 6 years ago

Hi Noushin, Thanks for the question. The likelihood of the homo-ref is calculated as with a variant allele fraction of 0.125 (line 321 of sccaller v1.2 script; which reflects amplification error). Best wishes, Xiao

noushinquazi commented 6 years ago

sorry, i meant homo-alt genotype.

biosinodx commented 6 years ago

Variant allele fraction for ref/ref genotype should be 1. Xiao

noushinquazi commented 6 years ago

in the paper, it says the variant allele fraction should be 1 for alt/alt if there are more variant reads, else 0. I am looking at equations 5 and 6.

biosinodx commented 6 years ago

Oh, thanks for notifying! It should be 1 for alt/alt. There were a couple of issues after converting to the final PDF, which unfortunately we didn't notice at the time. Related to this, equation 7 should be, p=f e/3 + (1-f) (1-e), if b=r p=f (1-e) + (1-f) e/3, if b=m p=e/3, the others Please always rely on the current codes, and let me know if there are other issues. Thanks! Xiao

noushinquazi commented 6 years ago

ah ok! i think there was one other difference and i wasn't sure if it was an issue with the paper or code: equation 1 regarding the kernel weighted average of major allele fractions is implemented differently in the code. for instance the numerator is the sum of kernel major allele fraction, but in the code the numerator is the sum of kernel major allele fraction * total count. the total count also appears as a term in the denominator, whereas it is absent in the paper.

biosinodx commented 6 years ago

Thank you for pointing it out! The total count should be considered. It seems that I should write a revised description of the method and upload to GitHub. Xiao

noushinquazi commented 6 years ago

could you explain why the total count should be considered? my understanding is that the point is to estimate the allelic bias using the nadarya-watson kernel weighted average over all the major allele fractions from sites within a specified window. why does total allele count have to factor into that?

biosinodx commented 6 years ago

Yes, it was for a practical reason. Consider the following situation: there are two SNPs in the region close to a target position; both SNPs are in a same distance to the target; one SNP has an allele fraction of 50% and a read count of 20 (good sequencing depth), but the other has an allele fraction of 100% and a read count of 2 (very poor sequencing depth). In this case, I trust the allele fraction estimated by the first SNP more than from the second SNP. Xiao

noushinquazi commented 6 years ago

i see, that makes sense. thanks!

noushinquazi commented 6 years ago

i hate to spam, but i noticed something else in a different part of the program: in gd_tracking, shouldn't the logic be considering snps that are on the same chromosome and have a position less than or equal to the target pos + HALF of lambda, since lambda defines the width of the entire window?

biosinodx commented 6 years ago

It is not spam at all:) Half or 1 lambda may not really matter for now, because the value of lambda is still to some extent arbitrary for the moment although there are biological reasons behind the choice of the value. I am actually considering to implement a method to estimate the lambda from the data itself. In that case, the choice between lambda and half a lambda will matter.

noushinquazi commented 6 years ago

i see. i was thinking that perhaps the choice of half or whole lambda in the kernel might not matter because the bias estimator's weights are normalized anyway.

noushinquazi commented 6 years ago

just to clarify, i was pointing out lambda because of its use in two different parts of the code: first in gd_tracking where you consider snps that are within plus/minus lambda of the target pos (which implies a window of 2*lambda), when in the kernel equation, only lambda is present in the denominator. However, I believe the bias estimator normalizes all the weights.

noushinquazi commented 6 years ago

A different question: why did you choose to ignore base quality when counting up the reference and alternate bases for the snps in the function golden_hetero?

biosinodx commented 6 years ago

In the function of golden_hetero, I assume that to some extent, we already know the genotype from bulk sequencing of the cell population and just confirm its presence in the single cell. So I tried to be simple. A related comment, a higher quality of heterozygous SNPs collected from bulk sequencing may increase the accuracy of results of single cells.

noushinquazi commented 6 years ago

Ok, i understand you're busy, but i have another question related to base quality: what was your reason for using 40 as base quality? is it some sort of standard?

biosinodx commented 6 years ago

Sure. Both score 30 and 40 are often used for base quality. I choose 40 as default, because mutations are rare in the genome and I wish to avoid as many artifacts as possible.

noushinquazi commented 6 years ago

more questions about filters: what is it about dbsnps vs control hsnps that makes it so they need a filter of 20 minimum reads? is there a biological reason why there needs to be at least 4 minimum var reads to count an hsnp? i wonder if the minimum var reads could scale with the number of reads an hsnp has...

biosinodx commented 6 years ago

First about the hSNPs, as mentioned earlier, I think better quality of hSNP will provide better result - 20 min. reads is a good quality. Well, the full criteria of hSNP can be improved when more testing data are available. Second about the 4 min reads, this is to ensure the quality of variant (make sure not sequencing errors). I took this criterion from the paper about VarScan.

noushinquazi commented 6 years ago

i see, so why not subject the control hsnps to the same filter? unless i missed something in the code?

noushinquazi commented 6 years ago

i am referencing this part: if gh_snptype=='dbsnp' and (gh_xaltcount<2 or gh_xrefcount<2 or (gh_xrefcount + gh_xaltcount < gh_readdepth)) : continue i only see the 20 min read being used for dbsnp, but what about for control?

biosinodx commented 6 years ago

About criteria for calling hSNP using "dbsnp" and "control", I used slightly different criteria because I assume "control" is more reliable because it is supposed to be a bulk sequencing of cells from the same subject of the single cell, while "dbsnp" is not from the same subject but from public database. However, as mentioned earlier, I feel the criteria can be improved when there is more test data available.