Illumina / Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
GNU General Public License v3.0
96 stars 16 forks source link

DP is greater than AD #44

Open roselucia opened 5 years ago

roselucia commented 5 years ago

Hallo,

I am using Pisces for variant calling in my TST170 data and have a question regarding the AD and DP values in the vcf. Why is the DP value greater than the sum of the two given AD values for the Alt and Reference Allele? Reading the following on Github made me think that the DP-AD(Alt)=AD(Ref): "Coverage at a particular location is calculated as the sum of all called reads which pass filters (have sufficient mapping quality and base call quality). Coverage is given as DP in the vcf file. Deleted bases have counted towards coverage since version 3.5.5 . SNV count uses the same filters as the coverage count filters. Indel/MNV count is the sum of all reads passing filters where we were able to determine an indel/MNV. The estimate reference count is the coverage minus the variant count. (So, for an indel or MNV, the "reference count" is really the total number of passing reads that did not have the given variant). Variant frequency is calculated by dividing the variant count by the coverage at the given location*.“ https://github.com/Illumina/Pisces/wiki

I would be very thankful for you help.

Best regards, Rose

tamsen commented 5 years ago

Hi there,

DP count is almost always greater than REF + ALT count. There are often loci with any combination of A,C,G,T, Del, etc discovered. DP accounts for all of them. Ref count is the count in exactly just the ref allele. Alt count is exactly the count in just the called alt allele.

There are some rare cases (I think restricted to older versions of Pisces) where the ALT count can get higher than the DP count and thats more problematic. Typically its because the variant count might be determined across one set of loci, and the reference count might be determined across a slightly different set of loci (ie, sps variant is an indel) and there might be a jump in read depth in one set of loci and not the other. You can also look at the loci in the BAM in IGV and see if something like that is going on.

best Tamsen

roselucia commented 5 years ago

Hi, thanks for your response. Someone from the Illumina Technical Support already helped me out. However, he had a different explanation: "Allele depth will only account for reads assigned to one of the alleles which have been considered for the genotype. Some reads may have an allele which is not considered, which would get counted in DP but not in the sum of AD values. For example if ref is A, alt is C, there may be some T's and G's which may be present as errors." He furthermore told me that the DP value can also be smaller than than the AD value, as "the AD values (one for each of REF and ALT fields) is the unfiltered count of all reads that carried with them the REF and ALT alleles". However to my understanding the DP value as well as the AD value are filtered reads (for the position or the allele respectively) (see below).

DP = filtered Reads at particular position "Coverage at a particular location is calculated as the sum of all called reads which pass filters (have sufficient mapping quality and base call quality). Coverage is given as DP in the vcf file." (https://github.com/Illumina/Pisces/wiki https://github.com/Illumina/Pisces/wiki)

AD= filtered allelic depth/allele depth (the number of filtered reads called for the Ref Allele and the Alt Allele.) "The AD field in the .vcf file reflects any filtering done by the somatic variant caller. In particular: Reads with low mapping quality (below 1) are not included Individual base calls with low base call quality (by default, below 20; this is override-able with the MinQScore setting) Because of this read- and base-level filtering, the raw depth from the bam file will often be greater than the value reported in the AD field." (https://github.com/Illumina/Pisces/wiki/Frequently-Asked-Questions https://github.com/Illumina/Pisces/wiki/Frequently-Asked-Questions)

We used Pisces 5.1.4.92.

Thanks for for your help.

Best regards Rose

Am 03.09.2019 um 18:53 schrieb tamsen <notifications@github.com mailto:notifications@github.com>:

Hi there,

What version of Pisces are you using? In theory it shouldnt happen. In practice, sometimes it does, especially with older versions. Can you send me an example of the questionable line in the vcf and the version number? I can take a guess at the most likely explanation. Typically its because the variant count might be determined across one set of loci, and the reference count might be determined across a slightly different set of loci (ie, sps variant is an indel) and there might be a jump in read depth in one set of loci and not the other. You can also look at the loci in the BAM in IGV and see if something like that is going on.

best Tamsen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/Pisces/issues/44?email_source=notifications&email_token=AMU6KOCV2NOMDU7UXC2HMK3QH2I7DA5CNFSM4IP7ZKJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5Y3DFY#issuecomment-527544727, or mute the thread https://github.com/notifications/unsubscribe-auth/AMU6KOGOX4NPQVBDXDNVALTQH2I7DANCNFSM4IP7ZKJA.

tamsen commented 5 years ago

Hi. Yes, the tech support answer is fine! I was over-explaining how ALT counts can sometimes get higher than total DP counts. But as you are only asking about how DP can be greater than REF counts + ALT counts, its a simpler explanation. There are often loci with any combination of A,C,G,T, Del, etc discovered. DP accounts for all of them. Ref count is the count in exactly just the ref allele. Alt count is exactly the count in just the called alt allele.

I'll edit my answer.

roselucia commented 5 years ago

Good morning,

thanks for your fast response. Do you maybe have an explanation for me why on github it says that the DP values as well as the AD values are both filtered values? As this does not match the explanation of the Illumina Support.

DP = filtered Reads at particular position "Coverage at a particular location is calculated as the sum of all called reads which pass filters (have sufficient mapping quality and base call quality). Coverage is given as DP in the vcf file." (https://github.com/Illumina/Pisces/wiki https://github.com/Illumina/Pisces/wiki)

AD= filtered allelic depth/allele depth (the number of filtered reads called for the Ref Allele and the Alt Allele.) "The AD field in the .vcf file reflects any filtering done by the somatic variant caller. In particular: Reads with low mapping quality (below 1) are not included Individual base calls with low base call quality (by default, below 20; this is override-able with the MinQScore setting) Because of this read- and base-level filtering, the raw depth from the bam file will often be greater than the value reported in the AD field." (https://github.com/Illumina/Pisces/wiki/Frequently-Asked-Questions https://github.com/Illumina/Pisces/wiki/Frequently-Asked-Questions)

Thanks a lot! Best regards, Rose

Am 04.09.2019 um 17:18 schrieb tamsen notifications@github.com:

Hi. Yes, the tech support answer is fine! I was over-explaining how ALT counts can sometimes get higher than total DP counts. But as you are only asking about how DP can be greater than REF counts + ALT counts, its a simpler explanation. There are often loci with any combination of A,C,G,T, Del, etc discovered. DP accounts for all of them. Ref count is the count in exactly just the ref allele. Alt count is exactly the count in just the called alt allele.

I'll edit my answer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/Pisces/issues/44?email_source=notifications&email_token=AMU6KODIMCHXY6K3PNNONHDQH7GULA5CNFSM4IP7ZKJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5353CI#issuecomment-527949193, or mute the thread https://github.com/notifications/unsubscribe-auth/AMU6KOFGIMCABQQXTD44LF3QH7GULANCNFSM4IP7ZKJA.

tamsen commented 5 years ago

Hi there,

So ALL reads used by Pisces are filtered. Pisces reads in the bam and before doing any variant calling filters out reads with poor mapping score and bases with poor base call quality. Once this is done, Pisces counts DP, ALT, REF etc. This is why depth according to Pisces will look lower than depth in (for example) IGV, because IGV counts depth from the raw reads, and Pisces only uses filtered reads.

roselucia commented 5 years ago

Hi,

many thanks for your help. So the explanation why the DP value can also be smaller than than the AD value of the Illumina Technical Support is incorrect then? The technical support explained to me that "the AD values (one for each of REF and ALT fields) is the unfiltered count of all reads that carried with them the REF and ALT alleles“. As I am woking with different panels (Human Breast Cancer Panel from QIAGEN and TST170 Panel from Illumina), and thus different variant callers (smCounter2 (https://www.biorxiv.org/content/biorxiv/early/2018/03/14/281659.full.pdf) and Pisces), I am trying to define general criteria for the inclusion of variants in order to compare variants called by the two panels. I believed the DP value to be such a criteria, however I think a comparison like this is impossible as the process of variant calling is very different between those two engines.

Thanks again for your support!

All the best, Rose

Am 25.09.2019 um 20:51 schrieb tamsen notifications@github.com:

Hi there,

So ALL reads used by Pisces are filtered. Pisces reads in the bam and before doing any variant calling filters out reads with poor mapping score and bases with poor base call quality. Once this is done, Pisces counts DP, ALT, REF etc. This is why depth according to Pisces will look lower than depth in (for example) IGV, because IGV counts depth from the raw reads, and Pisces only uses filtered reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/Pisces/issues/44?email_source=notifications&email_token=AMU6KODRCDHESBACHGP56XLQLOXJTA5CNFSM4IP7ZKJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7S6SJQ#issuecomment-535161126, or mute the thread https://github.com/notifications/unsubscribe-auth/AMU6KOE27OLQJVNTDO2CMK3QLOXJTANCNFSM4IP7ZKJA.

tamsen commented 5 years ago

Hi,

No, tech support isnt wrong, per se. Its just that there are many filters at many levels. All reads used by Pisces are filtered before variant calling takes place, so read-filters affect counts of EVERYTHING going into calling. After variant calling, the candidate variants are additionally filtered (independently evaluated as likely variants or artifacts) and some filtered variants at a site might never make it to the final vcf. That latter filter is probably what tech support meant.

Ie, suppose there are 10 reads that cover position 5. Sps 2 reads fail read filters and get thrown out. Then the total DP in pisces vcf will be 8. Now, sps 4 of those reads are A->C SNP, and 2 reads are reference, and 2 reads are A->G. Pisces might filter the A->G as noise. And so the ref count will be 2 and the alt count will be 4. And the total DP will be still be 8. (Meanwhile IGV or another caller might show a depth of 10)

Yes, DP is useful when comparing callers, but it can also be misleading, as depth is not as objective as one might think. Note that read-stitching also greatly affects reported depth.

best Tamsen

roselucia commented 5 years ago

Hi,

thank you ever so much. So I understand correct, that the DP value can be smaller than the sum of AD REF and AD ALT, because additional filtering steps are added after variant calling, which will effect the DP value, but not the AD values, right?

Thanks a lot.

All the best. Rose

Am 09.10.2019 um 23:12 schrieb tamsen notifications@github.com:

Hi,

No, tech support isnt wrong, per se. Its just that there are many filters at many levels. All reads used by Pisces are filtered before variant calling takes place, so read-filters affect counts of EVERYTHING going into calling. After variant calling, the candidate variants are additionally filtered (independently evaluated as likely variants or artifacts) and some filtered variants at a site might never make it to the final vcf. That latter filter is probably what tech support meant.

Ie, suppose there are 10 reads that cover position 5. Sps 2 reads fail read filters and get thrown out. Then the total DP in pisces vcf will be 8. Now, sps 4 of those reads are A->C SNP, and 2 reads are reference, and 2 reads are A->G. Pisces might filter the A->G as noise. And so the ref count will be 2 and the alt count will be 4. And the total DP will be still be 8. (Meanwhile IGV or another caller might show a depth of 10)

Yes, DP is useful when comparing callers, but it can also be misleading, as depth is not as objective as one might think. Note that read-stitching also greatly affects reported depth.

best Tamsen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/Pisces/issues/44?email_source=notifications&email_token=AMU6KODKYLUXRKHYXIGBL6LQNZCMZA5CNFSM4IP7ZKJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAZNAMI#issuecomment-540201009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMU6KOBAFX52FLK6JDUNTCTQNZCMZANCNFSM4IP7ZKJA.

tamsen commented 5 years ago

There will be various levels of filtering, and read stitching, and for some types of variants (such as indels) AD is calculated a bit differently than total DP, so any of these can affect DP being greater or smaller than the sum of the alt and ref AD.

roselucia commented 5 years ago

Thanks! Alle the best, Rose

Am 10.10.2019 um 20:32 schrieb tamsen notifications@github.com:

There will be various levels of filtering, and read stitching, and for some types of variants (such as indels) AD is calculated a bit differently than total DP, so any of these can affect DP being greater or smaller than the sum of the alt and ref AD.

roselucia commented 5 years ago

Hi,

may I ask you one further questions?

You told me that the AD value is filtered. ("So ALL reads used by Pisces are filtered. Pisces reads in the bam and before doing any variant calling filters out reads with poor mapping score and bases with poor base call quality. Once this is done, Pisces counts DP, ALT, REF“). When I told you that Illumina Tech Support says that AD is unfiltered, you replied the following: "No, tech support isnt wrong, per se. Its just that there are many filters at many levels. All reads used by Pisces are filtered before variant calling takes place, so read-filters affect counts of EVERYTHING going into calling. After variant calling, the candidate variants are additionally filtered (independently evaluated as likely variants or artifacts) and some filtered variants at a site might never make it to the final vcf. That latter filter is probably what tech support meant. Ie, suppose there are 10 reads that cover position 5. Sps 2 reads fail read filters and get thrown out. Then the total DP in pisces vcf will be 8. Now, sps 4 of those reads are A->C SNP, and 2 reads are reference, and 2 reads are A->G. Pisces might filter the A->G as noise. And so the ref count will be 2 and the alt count will be 4. And the total DP will be still be 8. (Meanwhile IGV or another caller might show a depth of 10)"

If AD is truely also filtered then I do not understand why ADRef + AD Alt can be > DP_sample_FORMAT. Illumina Tech support explained this with DP being filtered (see below). May I ask you for clarification again?

Thanks a lot!

All the best, Rose

Explanation by the Illumina Tech Support:

-(AD Ref + AD Alt) < DP (samle FORMAT) „I've been looking into this, there are examples of DP being greater than AD for the alt allele. Allele depth will only account for reads assigned to one of the alleles which have been considered for the genotype. Some reads may have an allele which is not considered, which would get counted in DP but not in the sum of AD values. For example if ref is A, alt is C, there may be some T's and G's which may be present as errors. I found the following forum post: https://gatkforums.broadinstitute.org/gatk/discussion/2215/ad-value-higher-than-the-dp-value https://gatkforums.broadinstitute.org/gatk/discussion/2215/ad-value-higher-than-the-dp-value

-AD Ref + AD Alt > DP (sample FORMAT) „While the sample-level (FORMAT) DP field describes the total depth of reads that passed the Unified Genotyper's internal quality control metrics (like MAPQ > 17, for example), the AD values (one for each of REF and ALT fields) is the unfiltered count of all reads that carried with them the REF and ALT alleles. The reason for this distinction is that the DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL. Please note, however, that the AD isn't necessarily calculated exactly for indels. Only reads which are statistically favoring one allele over the other are counted. Because of this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informatice reads. Because the AD includes reads and bases that were filtered by the Unified Genotyper and in case of indels is based on a statistical computation, one should not base assumptions about the underlying genotype based on it; instead, the genotype likelihoods (PLs) are what determine the genotype calls.“

Am 10.10.2019 um 20:45 schrieb rose froehlich rosefroehlich@googlemail.com:

Thanks! Alle the best, Rose

Am 10.10.2019 um 20:32 schrieb tamsen <notifications@github.com mailto:notifications@github.com>:

There will be various levels of filtering, and read stitching, and for some types of variants (such as indels) AD is calculated a bit differently than total DP, so any of these can affect DP being greater or smaller than the sum of the alt and ref AD.

roselucia commented 5 years ago

Hi Tamsen,

may I send you a private email or message? If so I would appreciate a contact. Thanks a lot! Rose

tamsen commented 5 years ago

Hi there, I have your email. I will contact you!