Open fpbarthel opened 9 years ago
Is there a copy-paste error on Run 2 or does the output actually contain those floating point numbers after the phycov field? Have you verified that your files are indeed identical?
Sorry! The floating point numbers were an artifact by using excel. I'm correcting it now. The conf files are also indentical:
Run1
readgroup:C19LW.4 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-01/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127239 num:9994 lower:0.00 upper:745.32 mean:310.45 std:122.05 SWnormality:-63.74 flag:0(1.71%)1(0.02%)18(94.90%)20(2.43%)32(0.87%)4(0.03%)64(0.01%)8(0.01%)20533 exe:samtools view readgroup:D1ERB.1 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.40%)1(0.02%)18(94.76%)20(2.18%)32(0.59%)4(0.05%)64(0.01%)10247 exe:samtools view readgroup:D1ERB.2 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.16%)1(0.02%)18(95.70%)20(1.52%)32(0.52%)4(0.07%)64(0.01%)10315 exe:samtools view readgroup:C1A1H.8 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127186 num:10000 lower:0.00 upper:572.48 mean:311.71 std:90.79 SWnormality:-74.72 flag:0(1.18%)1(0.00%)18(96.66%)20(1.76%)32(0.36%)4(0.02%)8(0.01%)20310 exe:samtools view
Run2
readgroup:C19LW.4 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/1d77eb0d-c0a9-4681-9227-b3738cbf265c/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127239 num:9994 lower:0.00 upper:745.32 mean:310.45 std:122.05 SWnormality:-63.74 flag:0(1.71%)1(0.02%)18(94.90%)20(2.43%)32(0.87%)4(0.03%)64(0.01%)8(0.01%)20533 exe:samtools view readgroup:D1ERB.1 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.40%)1(0.02%)18(94.76%)20(2.18%)32(0.59%)4(0.05%)64(0.01%)10247 exe:samtools view readgroup:D1ERB.2 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.16%)1(0.02%)18(95.70%)20(1.52%)32(0.52%)4(0.07%)64(0.01%)10315 exe:samtools view readgroup:C1A1H.8 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127186 num:10000 lower:0.00 upper:572.48 mean:311.71 std:90.79 SWnormality:-74.72 flag:0(1.18%)1(0.00%)18(96.66%)20(1.76%)32(0.36%)4(0.02%)8(0.01%)20310 exe:samtools view
I was curious about the BAM files themselves, although that would not make sense for the normal where one library is identical and the other has odd numbers.
This is unexpected and I will review the coverage code to see if that gives us any hints.
The only thing I see right now is that the output for Run2 contains some odd floating point numbers. For example: 1.756944444 2:14956 4:525772 1.966666667 32:123892
should look more like 1:2470 2:14956 4:525772 8:2352 32:123892
from your original run.
If the Run 2 output is correct and not something getting mangled during the posting of this issue then that is concerning.
My mistake, I thought it was correct, but I was opening the files with two different programs. They are both colon seperated numbers, I've corrected the run2 output now.
On Wed, Jul 22, 2015 at 3:51 PM, Dave Larson notifications@github.com wrote:
I was curious about the BAM files themselves, although that would not make sense for the normal where one library is identical and the other has odd numbers.
This is unexpected and I will review the coverage code to see if that gives us any hints.
The only thing I see right now is that the output for Run2 contains some odd floating point numbers. For example: 1.756944444 2:14956 4:525772 1.966666667 32:123892 should look more like 1:2470 2:14956 4:525772 8:2352 32:123892 from your original run.
If the Run 2 output is correct and not something getting mangled during the posting of this issue then that is concerning.
— Reply to this email directly or view it on GitHub https://github.com/genome/breakdancer/issues/7#issuecomment-123858376.
Yep. I see the correction. Sorry I missed your note on the first exchange with the conf. It's been a while since I've looked at this part of the code, but I'll see what I can find.
Just a thought, if it turns out something weird is going on that you can't find any explanation for (which I'm sort of suspecting for now), I'd like to make a simple integrity checker that checks the integrity of individual output files by re-calculating the phycov and seqcov values from the input BAM files and comparing it to the values noted in the output. If this is the case, could you explain how to compute these values from the BAM file?
Seems reasonable. I, too, fear that we don't have a lot to go on here and are unlikely to find a smoking gun. It would help to know if the issue is transient or repeatable.
I can tell you how these values are computed.
Each BAM is scanned and the following values are computed:
covered_reference_length = the length of the genome excluding regions at the end of each chromosome where no reads are observed to map. This is calculated by adding up the distance between the start position of the first read on a chromosome and the last read on a chromosome across all chromosomes in the genome.
number_proper_paired_reads_in_library = the number of properly paired reads observed in the given library. Excludes reads not meeting the mapping quality threshold.
number_proper_paired_reads_in_BAM = the number of properly paired reads observed in the BAM. Excludes reads not meeting the mapping quality threshold.
library_read_length = length of the reads as appeared in the config file for the library
library_mean_size = mean insert size as reported in the config file
If you have these values for the library then:
sequence_coverage = number_proper_paired_reads_in_library * library_read_length / covered_reference_length
physical_coverage = number_proper_paired_reads_in_library * library_mean_size / covered_reference_length / 2
I'm operating under the assumption that something went wrong with the per-library calculations for Sage-127186
. The fact that the other library for that file seems to be have been identical between runs suggests that parsing of the BAM file itself went fine.
I don't see anything obvious at first glance. It is possible for the configuration to specify a per-library mapping quality cutoff, but I don't see that in your config files. Other ways to avoid counting reads for a single library could be if those reads are not properly paired or if they were somehow misassigned by the code to the wrong library (or a non-existent library). I'm not going to be able to figure that out from the examples you have provided.
Can you try running again and seeing if you see a repeat of Run 2? If that's the case, then maybe we can do some detailed debugging.
Do the number_proper_pairedreads* variables exclude duplicate reads?
Yes
What's the "reflen" variable in the config file? Is that the same as "covered_reference_length" you describe but for the whole genome?
It is. Those are the same values.
Ok, I'll calculate the phycov and seqcov to see if it agrees most with run1 or run2 and in the meantime ill copy the files over again to re-run breakdancer. Thanks.
So together with my senior colleague @Syzheng we programmatically calculated the phycov and seqcov per chromosome of ~70 BAM files using python and compared them to run #1 and run #2. Interestingly we found that in a sizable subset neither run #1 nor run #2 provides a good match to the calculated values. Furthermore, we found another run somewhere in our data archives (let's call this run #3) which was ran on breakdancer 1.1 (Ken Chen's version). Surprisingly (but also somewhat fortunately) we found that the phycov and seqcov values in run #3 match very closely to the ones we calculated. Our suspicion right now is that there must be a bug in 1.4.5 that's causing this, but are at a loss exactly what it is. @Syzheng is rerunning a few samples using breakdancer 1.4.5 (let's call this run #4) to see what that does.
Yikes. Well, if we can get a good example with a publicly available BAM file then I'd be happy to try to track down the source of the discrepancy.
Also, did you run the perl 1.1 version or the C++ version?
Hi Dave, we use cpp v1.1.2. Do you mind contacting me via email szheng2@mdanderson.org? I would like to share some data that might help with this issue. Thanks.
Sorry it has taken me so very long, but I think I have this fixed.
The issue occurs when data is excluded in the config. Those reads with read groups not in the config file are assigned to the first library in the config. Config files where all read groups are present will not have this issue.
Check out the config_fix branch and give it a try.
We (unintentionally) ran breakdancer twice on several sample tumor-normal pair using the same breakdancer version. The first run was 6 months ago and the second run was one month ago. We noticed that the results differ drastically despite the fact that the config files generated by bam2cfg are identical (except the paths to the bam files). Whereas run 1 has several dozen somatic events, run 2 has no somatic events. In the header parameters we noticed differences in for example seqcov and phycov. How are these values determined? How can we explain these drastic differences? I’ve highlighted similarities between run1 and 2 in yellow and differences in cyan.
Run 1:
Run 2: