Bloated ref counts for variant size <= 2

angadps commented 8 years ago

Dear Kai and other Pindel authors,

Recently I have spent much time fixing allele fraction values for Pindel calls. This may apply equally to old and new versions of Pindel and depends only on the size of indel in question, being a problem only for indel sizes <= 2

Ref count in the isRefRead function is calculated based on edit distance value of individual reads. If edit distance <= NM (default of 2), the read is considered a reference read. This means a read with a 1bp indel (and no other edits) is also considered reference read. This bloats reference read counts and often homozygous nulls are reported as heterozygous.

I'm trying to fix this by calculating total read count (rather than ref read count) using the same code flow as for the alt allele read handling. I've duplicated the build_record_SR function where I'm including not just the "isWeirdRead" reads. I've duplicated the flush function too with the following change:

            if (m_rawreads[i].hasCloseEnd()) {
                    updateReadAfterCloseEndMapping(m_rawreads[i]);
            } // I update the read but irrespective of having a close end I add it to the total read count
                    #pragma omp critical
                    {
                            m_rawreads[i].SampleName2Number.insert(std::pair <std::string, unsigned> (m_rawreads[i].Tag, 1));
                            m_filteredReads.push_back(m_rawreads[i]);
                    }

In short, I update the read but, irrespective of having a close end I add it to the total read count. This helps me include both reference and alternate allele containing reads. However, many reads may get filtered out while calculating total alt count which are not being filtered out now. Hence I still get a small skew for AF values (peaks close to 40% and 80%).

Can somebody advise on how best I can modify the code base for a more accurate assessment of AF values. I'm happy to share my changes if I have something that is worth incorporating.

Thanks, --Angad.

a9473 commented 8 years ago

We recently noticed this issue as well. Not knowing the codebase, I'm not sure how best to achieve it, but it would be great to have a fix so that the allelic fractions we compute from the AD field for 1- and 2-bp indels become similar to those for >2-bp indels.

Andrew

liangkaiye commented 8 years ago

We could modify the way to compute the number of reference supporting reads.

kaiye2016@outlook.com

From: adeirossi Date: 2016-07-30 07:39 To: genome/pindel Subject: Re: [genome/pindel] Bloated ref counts for variant size <= 2 (#47) We recently noticed this issue as well. Not knowing the codebase, I'm not sure how best to achieve it, but it would be great to have a fix so that the allelic fractions we compute from the AD field for 1- and 2-bp indels become similar to those for >2-bp indels. Andrew — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

angadps commented 8 years ago

That would be great. Any pointers on the approach to follow would be very helpful, too.

EWLameijer commented 8 years ago

Dear Angad,

at the moment I'm also looking at the problem. The fundamental problem (that is, is a certain read more likely to be compatible with the ref or with the proposed alt) may not be very easy to solve, as that necessitates that the isRefRead-function also gets information about the proposed alt allele - which is possible, but likely not something that is fast or easy to implement).

Anyway, I've taken the liberty for your (and possibly other interested party's) benefit of creating a new branch called "studying_ref_support_problem", in which I have been 'tenderizing' some of the relevant code (adding comments, removing commented-out dead code - regression testing as I go). I still need to do some work (studying the build_record_SR, fetch_func_SR and flush functions), and I'm also compiling a list of questions for Kai.

For you, I have two questions which it would be great if you could answer/help. 1) can you try make a branch with your changes or send/show some of the functions you created? 2) You wrote that you duplicated the build_record_SR function to not only include the isWeirdReads. However, the isWeirdRead is (in the master branch) only called in the fetch_func_SR, not in build_record_SR. So did you mean the fetch_func_SR instead or did you separately add it to the build_record_SR?

Best regards,

Eric-Wubbo

angadps commented 8 years ago

Dear Eric-Wubbo,

I appreciate you looking at the problem. I think creating a new branch is a good start. I would like to update all my changes but, I've tried three different approaches without using an SVN, and uploading all my changes may confuse you more than make things clear. I am however, happy to post relevant functions/code to the discussion.

Talking about the isRefRead-function, I think it would be difficult to work with alt allele information here. Suppose there are two potential alt-alleles close to each other (as is the case far too often), a read-pair may harbor one alt-allele only making it alt-read for one alt-allele and ref-read for the other. Hence the values wont be precise.

An alternative approach in my opinion is to calculate total (and not ref) read count, which is to do with your second question. I think I may not have been very clear here so I'll explain again:

To calculate total read count, I create a new function build_record_SR_Read, which I call inside of fetch_func_SR this way:

    if (isGoodAnchor( b1_flags, b1)) {
    //      if (RN == query) std::cout << "isGoodAnchor" << std::endl;
            if (isWeirdRead( b2_flags, b2 )) {
    //              if (RN == query) std::cout << "################################################## found b1 b2 " << query << std::endl;
                    build_record_SR (b1, b2, data);
            }
            build_record_SR_Read(b1, b2, data); // Angad: My new function call here
            if (isRefRead( b2_flags, b2 )) {
    //              if (RN == query) std::cout << "################################################## found ref b2 b1 " << query << std::endl;
                    build_record_RefRead (b1, b2, data);
                    //std::cout << "refread 1" << std::endl;
            }
    }

Similarly for "b2_flags, b2" below that. The isWeirdRead function looks for alt-allele containing reads only but the new function builds both ref and alt reads. The build_record_SR_Read function is practically identical to the build_record_SR function with a minor change. I have a second read buffer in "struct fetch_func_data_SR" now to populate total read count (again, in a second "std::vector < SPLIT_READ > *TotalReads" variable in the same struct) using a second addRead2 function. The only line of difference in build_record_SR_Read is:

data_for_bam->readBuffer2->addRead2(Temp_One_Read);

This calls a second flush2 function:


  void ReadBuffer::addRead2( const SPLIT_READ& newRead )
  {
     if (m_currentsize == m_CAPACITY ) {
        flush2(); // Angad: the flush() functions perform a critical task of updating the SPLIT_READ data structures. The m_rawreads is a temporary buffer and reads are selectively added to m_filteredReads (which is really a reference to the SPLIT_READ vector)
     }
     m_rawreads.push_back(newRead);
     m_currentsize++;
  }

  void ReadBuffer::flush2()
  {
        #pragma omp parallel for
        for (int i=0; i<m_currentsize ; i++ ) {
                if (m_rawreads[i].MapperSplit == false)
                        GetCloseEnd(m_CHROMOSOME, m_rawreads[i]);

                if (m_rawreads[i].hasCloseEnd()) {
                        updateReadAfterCloseEndMapping(m_rawreads[i]);
                } // Angad: This closing parenthesis is moved up so that m_filteredReads gets populated all the time.
                        #pragma omp critical
                        {
                                m_rawreads[i].SampleName2Number.insert(std::pair <std::string, unsigned> (m_rawreads[i].Tag, 1));
                                m_filteredReads.push_back(m_rawreads[i]);
                        }
        }
        m_rawreads.clear();
        m_currentsize = 0;
  }

The problem starts here now as I attempt to apply an identical set of filters to my TotalReads vector, as are applied to the alt allele vector (or possibly another reason such as duplicate-marking or readInSpecifiedRegion() calculation that I don't know understand yet). To do this my first big change is to push into m_filteredReads (in flush2 above), all reads irrespective of whether hasCloseEnd() passes on them or not (I still call updateReadAfterCloseEndMapping() though). This gives me reasonably accurate counts for many indels, but still high ref counts for others.

I've made corresponding changes to duplicate UpdateRefReadCoverage by using TotalReads instead of RefSupportingReads. I've also made changes to various functions in reporter.cpp to call calculateRefSupportPerTag in addition to calculateSupportPerTag with minor (or possibly major ) differences. I can detail on this as needed.

BTW, I must tell you that I'm working on an older code base (Pindel version 0.2.5a3, Oct 24 2013.) and not the latest. But I suppose this part of the code has not changed much between them. I'm happy to answer further questions.

Thanks, --Angad.

EWLameijer commented 8 years ago

Hi Angad,

First of all, thank you for your response!

As an aside, Kai also gave a suggestion to work with the CIGAR-string; I haven't pulled it to master yet, but you can look at its results in the studying_ref_support_problem branch.

Responding further to your e-mail: you are correct in stating that a read that indicates "ref" for one SV may be the "alt" of another (close) SV. I suppose that reads should actually be counted and classified per SV, instead of being classified once in a 'universal' way.

To respond to the rest of your mail - as Kai started counting the reference reads around the time my employment at the LUMC ended, I unfortunately don't have that much knowledge on the role of the fetch_func and build_record functions, or how they fit into the larger picture. Kai may be able to give some help and explanation when he returns from his travels, as I feel I really need to understand the purpose of those functions better before I can reliably work with them or merge your work into master. I hope I can give some more news in a few days; I'll in any case continue some refactoring (well, adding comments) to at least help my work (and possibly that of others) in the future.

Best regards, and please stay tuned for a while longer...

Eric-Wubbo

a9473 commented 8 years ago

Many thanks to Eric-Wubbo for taking a crack at a fix. I tried running pindel from the studying_ref_support_problem branch (commit 686d854) on simulated BAM files and one real BAM file containing heterozygous indels of various lengths (including -2, -1, 1, 2, and longer ones). I'm pleased to report that the ref counts in the AD field now look reasonable in every case. Granted, these datasets are all examples of the simple case of a single isolated ALT indel (with error-free simulated reads), but I am very encouraged by this first-pass result.

liangkaiye commented 8 years ago

that is great! EW could push the changes to master, then.

On Fri, Aug 5, 2016 at 9:01 AM, adeirossi notifications@github.com wrote:

Many thanks to Eric-Wubbo for taking a crack at a fix. I tried running pindel from the studying_ref_support_problem branch (commit 686d854 https://github.com/genome/pindel/commit/686d8540fee4a7997d6085380fd18e2688171383) on simulated BAM files and one real BAM file containing heterozygous indels of various lengths (including -2, -1, 1, 2, and longer ones). I'm pleased to report that the ref counts in the AD field now look reasonable in every case. Granted, these datasets are all examples of the simple case of a single isolated ALT indel (with error-free simulated reads), but I am very encouraged by this first-pass result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/genome/pindel/issues/47#issuecomment-237729370, or mute the thread https://github.com/notifications/unsubscribe-auth/AB9s-0eT4I-qp7kb__Yp53ucE31EoDbGks5qcot-gaJpZM4JUP4- .

EWLameijer commented 8 years ago

Andrew, thank you very much for your testing and feedback! Kai, thank you very much for the suggestion you e-mailed me on the CIGAR-strings. Given Andrew's results, it indeed seems to have been a good approach.

I've updated the gold standard and pushed the branch to master. @Angad: could you take a look whether this update also solves your issue?

angadps commented 8 years ago

Hi Eric-Wubbo and Kai,

Thanks for the fix. It is good to know that there are encouraging results for certain datasets. I'll test this very soon myself and share my findings with you. I have a bunch of confirmed indels in my test set so I can pay close attention to the AF values observed. Hopefully they will be close enough. Thanks.

angadps commented 8 years ago

Hi Eric-Wubbo,

Thanks a lot for your fix. I tested your code on two different libraries and I see an AF distribution which is much closer to expectation than before. See attached plot for the distribution: lib-02528pc2-af-histogram With good peaks at > 90% and ~50%, I do also notice a bunch of sites in the 10-40% range though. This may be since a ref read is still decided using the same manner and prone to error every once in a while. For now I'll keep that in mind and see if there is another way to handle them.

Also, I have merged all changes in this branch with an older Pindel release. The latest Pindel can be generally very aggressive in calling indels due to a newer dup-marking strategy ( #33 - Alt allele reads with same start/stop site are never marked as duplicates while ref allele reads always are). For now I will continue using the older version as it works for our needs great. But for a future version of Pindel if you do get a chance to look into how duplicates are marked , that would be great.

Thanks again for putting this together in such a short period!

--Angad.

EWLameijer commented 8 years ago

Hi Angad,

thank you very much for your analysis! I'm glad this version is at least an improvement for you relative to the previous situation.

On the ref counts: it could be that different classes of Pindel users (for example: cancer genome researchers versus people simply studying regular genotype/phenotype relationships) have different priorities in how sensitive/specific they want Pindel to be for reference and alt alleles. However, you needing to maintain and (as in this case) selectively update an old version of pindel seems troublesome, and you are likely not the only user with a problem like this.

Anyway, I'll take this up with Kai - his insight in Pindel is much greater than mine (he was also the person who thought of the fix for this problem), and I'll try to squeeze some hours from my next working weeks to see how far I can get. I don't dare to make promises, but I'll do my best, as the ref/alt problem seems quite important.

Best regards,

Eric-Wubbo

angadps commented 8 years ago

Hi Eric-Wubbo,

Working with tumor genomes, I definitely am looking for greater sensitivity of Pindel calls. That's why Pindel-C is a good option. However, when working with high coverage libraries (1000X+), duplication rates can be high and ref vs alt counts are highly skewed. I've seen low allelic fractions that were reported at up to 10 times their actual expected AF value. This is definitely a big problem. This may not be as big a problem for 100-200X coverage though. I see the motivation behind this approach to improve sensitivity. I wouldn't ask you to trouble yourself a lot with this issue and don't seek any promises either. Thanks again for your help.

Best regards, --Angad.

EWLameijer commented 8 years ago

Hi Angad,

thank you for your extra clarification.

Anyway, this problem still seems significant enough to put on a list 'open issues'; I'll have to discuss with Kai and possibly other Pindel authors on its priority. But if I encounter an easy way to solve it, I'll definitely post it here!

Best regards,

Eric-Wubbo

genome / pindel

Bloated ref counts for variant size <= 2 #47