genome / breakdancer

SV detection from paired end reads mapping
113 stars 42 forks source link

Drastic differences between identical breakdancer runs #7

Open fpbarthel opened 9 years ago

fpbarthel commented 9 years ago

We (unintentionally) ran breakdancer twice on several sample tumor-normal pair using the same breakdancer version. The first run was 6 months ago and the second run was one month ago. We noticed that the results differ drastically despite the fact that the config files generated by bam2cfg are identical (except the paths to the bam files). Whereas run 1 has several dozen somatic events, run 2 has no somatic events. In the header parameters we noticed differences in for example seqcov and phycov. How are these values determined? How can we explain these drastic differences? I’ve highlighted similarities between run1 and 2 in yellow and differences in cyan.

Run 1:

Software: 1.4.5 (commit 251f983)

Command: /scratch/genomic_med/vapor/flowr_bdancer/breakdancer_nautilus/breakdancer/bin/breakdancer-max -o 10 -h 1d77eb0d-c0a9-4681-9227-b3738cbf265c__0a2475da-42bb-4adb-b86c-5492511e09f1/conf.file

Library Statistics:

/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam mean:311.71 std:90.79 uppercutoff:572.48 lowercutoff:0 readlen:101 library:Sage-127186 reflen:135464726 seqcov:31.5374 phycov:48.6659 1:7262 2:27082 4:1178829 8:6890 32:338595

/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam mean:335.15 std:109.83 uppercutoff:642.46 lowercutoff:0 readlen:101 library:Sage-127238 reflen:135464726 seqcov:24.7833 phycov:41.1194 1:2470 2:14956 4:525772 8:2352 32:123892

/scratch/bcb/fpbarthel/TCGA/WGS/GBM/1d77eb0d-c0a9-4681-9227-b3738cbf265c/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam mean:310.45 std:122.05 uppercutoff:745.32 lowercutoff:0 readlen:101 library:Sage-127239 reflen:135464726 seqcov:9.45366 phycov:14.5291 1:3438 2:4854 4:238007 8:3072 32:152646

Run 2:

Software: 1.4.5 (commit 251f983)

Command: breakdancer-max -o 10 -h /scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/breakdancer/bd-02-2483.cfg

Library Statistics:

/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam mean:311.71 std:90.79 uppercutoff:572.48 lowercutoff:0 readlen:101 library:Sage-127186 reflen:135464726 seqcov:11.0158 phycov:16.9987 1:990 2:7492 4:204802 8:1000 32:46965

/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam mean:335.15 std:109.83 uppercutoff:642.46 lowercutoff:0 readlen:101 library:Sage-127238 reflen:135464726 seqcov:24.7833 phycov:41.1194 1:2470 2:14956 4:525772 8:2352 32:123892

/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-01/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam mean:310.45 std:122.05 uppercutoff:745.32 lowercutoff:0 readlen:101 library:Sage-127239 reflen:135464726 seqcov:29.9752 phycov:46.0684 1:9710 2:13554 4:1212034 8:8962 32:444276

ernfrid commented 9 years ago

Is there a copy-paste error on Run 2 or does the output actually contain those floating point numbers after the phycov field? Have you verified that your files are indeed identical?

fpbarthel commented 9 years ago

Sorry! The floating point numbers were an artifact by using excel. I'm correcting it now. The conf files are also indentical:

Run1

readgroup:C19LW.4 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-01/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127239 num:9994 lower:0.00 upper:745.32 mean:310.45 std:122.05 SWnormality:-63.74 flag:0(1.71%)1(0.02%)18(94.90%)20(2.43%)32(0.87%)4(0.03%)64(0.01%)8(0.01%)20533 exe:samtools view readgroup:D1ERB.1 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.40%)1(0.02%)18(94.76%)20(2.18%)32(0.59%)4(0.05%)64(0.01%)10247 exe:samtools view readgroup:D1ERB.2 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.16%)1(0.02%)18(95.70%)20(1.52%)32(0.52%)4(0.07%)64(0.01%)10315 exe:samtools view readgroup:C1A1H.8 platform:illumina map:/scratch/bcb/fpbarthel/WGS_GBM_TP_NB_1_of_3/TCGA-02-2483-10/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127186 num:10000 lower:0.00 upper:572.48 mean:311.71 std:90.79 SWnormality:-74.72 flag:0(1.18%)1(0.00%)18(96.66%)20(1.76%)32(0.36%)4(0.02%)8(0.01%)20310 exe:samtools view

Run2

readgroup:C19LW.4 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/1d77eb0d-c0a9-4681-9227-b3738cbf265c/G26378.TCGA-02-2483-01A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127239 num:9994 lower:0.00 upper:745.32 mean:310.45 std:122.05 SWnormality:-63.74 flag:0(1.71%)1(0.02%)18(94.90%)20(2.43%)32(0.87%)4(0.03%)64(0.01%)8(0.01%)20533 exe:samtools view readgroup:D1ERB.1 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.40%)1(0.02%)18(94.76%)20(2.18%)32(0.59%)4(0.05%)64(0.01%)10247 exe:samtools view readgroup:D1ERB.2 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127238 num:9998 lower:0.00 upper:642.46 mean:335.15 std:109.83 SWnormality:-73.59 flag:0(2.16%)1(0.02%)18(95.70%)20(1.52%)32(0.52%)4(0.07%)64(0.01%)10315 exe:samtools view readgroup:C1A1H.8 platform:illumina map:/scratch/bcb/fpbarthel/TCGA/WGS/GBM/0a2475da-42bb-4adb-b86c-5492511e09f1/G26378.TCGA-02-2483-10A-01D-1494-08.1.bam readlen:101.00 lib:Sage-127186 num:10000 lower:0.00 upper:572.48 mean:311.71 std:90.79 SWnormality:-74.72 flag:0(1.18%)1(0.00%)18(96.66%)20(1.76%)32(0.36%)4(0.02%)8(0.01%)20310 exe:samtools view

ernfrid commented 9 years ago

I was curious about the BAM files themselves, although that would not make sense for the normal where one library is identical and the other has odd numbers.

This is unexpected and I will review the coverage code to see if that gives us any hints.

The only thing I see right now is that the output for Run2 contains some odd floating point numbers. For example: 1.756944444 2:14956 4:525772 1.966666667 32:123892 should look more like 1:2470 2:14956 4:525772 8:2352 32:123892 from your original run.

If the Run 2 output is correct and not something getting mangled during the posting of this issue then that is concerning.

fpbarthel commented 9 years ago

My mistake, I thought it was correct, but I was opening the files with two different programs. They are both colon seperated numbers, I've corrected the run2 output now.

On Wed, Jul 22, 2015 at 3:51 PM, Dave Larson notifications@github.com wrote:

I was curious about the BAM files themselves, although that would not make sense for the normal where one library is identical and the other has odd numbers.

This is unexpected and I will review the coverage code to see if that gives us any hints.

The only thing I see right now is that the output for Run2 contains some odd floating point numbers. For example: 1.756944444 2:14956 4:525772 1.966666667 32:123892 should look more like 1:2470 2:14956 4:525772 8:2352 32:123892 from your original run.

If the Run 2 output is correct and not something getting mangled during the posting of this issue then that is concerning.

— Reply to this email directly or view it on GitHub https://github.com/genome/breakdancer/issues/7#issuecomment-123858376.

ernfrid commented 9 years ago

Yep. I see the correction. Sorry I missed your note on the first exchange with the conf. It's been a while since I've looked at this part of the code, but I'll see what I can find.

fpbarthel commented 9 years ago

Just a thought, if it turns out something weird is going on that you can't find any explanation for (which I'm sort of suspecting for now), I'd like to make a simple integrity checker that checks the integrity of individual output files by re-calculating the phycov and seqcov values from the input BAM files and comparing it to the values noted in the output. If this is the case, could you explain how to compute these values from the BAM file?

ernfrid commented 9 years ago

Seems reasonable. I, too, fear that we don't have a lot to go on here and are unlikely to find a smoking gun. It would help to know if the issue is transient or repeatable.

I can tell you how these values are computed.

Each BAM is scanned and the following values are computed:
    covered_reference_length = the length of the genome excluding regions at the end of each chromosome where no reads are observed to map. This is calculated by adding up the distance between the start position of the first read on a chromosome and the last read on a chromosome across all chromosomes in the genome.
    number_proper_paired_reads_in_library = the number of properly paired reads observed in the given library. Excludes reads not meeting the mapping quality threshold.
    number_proper_paired_reads_in_BAM = the number of properly paired reads observed in the BAM. Excludes reads not meeting the mapping quality threshold.
    library_read_length = length of the reads as appeared in the config file for the library
    library_mean_size = mean insert size as reported in the config file

If you have these values for the library then:
    sequence_coverage = number_proper_paired_reads_in_library * library_read_length / covered_reference_length
    physical_coverage = number_proper_paired_reads_in_library * library_mean_size / covered_reference_length / 2
ernfrid commented 9 years ago

I'm operating under the assumption that something went wrong with the per-library calculations for Sage-127186. The fact that the other library for that file seems to be have been identical between runs suggests that parsing of the BAM file itself went fine.

I don't see anything obvious at first glance. It is possible for the configuration to specify a per-library mapping quality cutoff, but I don't see that in your config files. Other ways to avoid counting reads for a single library could be if those reads are not properly paired or if they were somehow misassigned by the code to the wrong library (or a non-existent library). I'm not going to be able to figure that out from the examples you have provided.

Can you try running again and seeing if you see a repeat of Run 2? If that's the case, then maybe we can do some detailed debugging.

fpbarthel commented 9 years ago

Do the number_proper_pairedreads* variables exclude duplicate reads?

ernfrid commented 9 years ago

Yes

fpbarthel commented 9 years ago

What's the "reflen" variable in the config file? Is that the same as "covered_reference_length" you describe but for the whole genome?

ernfrid commented 9 years ago

It is. Those are the same values.

fpbarthel commented 9 years ago

Ok, I'll calculate the phycov and seqcov to see if it agrees most with run1 or run2 and in the meantime ill copy the files over again to re-run breakdancer. Thanks.

fpbarthel commented 9 years ago

So together with my senior colleague @Syzheng we programmatically calculated the phycov and seqcov per chromosome of ~70 BAM files using python and compared them to run #1 and run #2. Interestingly we found that in a sizable subset neither run #1 nor run #2 provides a good match to the calculated values. Furthermore, we found another run somewhere in our data archives (let's call this run #3) which was ran on breakdancer 1.1 (Ken Chen's version). Surprisingly (but also somewhat fortunately) we found that the phycov and seqcov values in run #3 match very closely to the ones we calculated. Our suspicion right now is that there must be a bug in 1.4.5 that's causing this, but are at a loss exactly what it is. @Syzheng is rerunning a few samples using breakdancer 1.4.5 (let's call this run #4) to see what that does.

ernfrid commented 9 years ago

Yikes. Well, if we can get a good example with a publicly available BAM file then I'd be happy to try to track down the source of the discrepancy.

ernfrid commented 9 years ago

Also, did you run the perl 1.1 version or the C++ version?

syzheng commented 9 years ago

Hi Dave, we use cpp v1.1.2. Do you mind contacting me via email szheng2@mdanderson.org? I would like to share some data that might help with this issue. Thanks.

ernfrid commented 8 years ago

Sorry it has taken me so very long, but I think I have this fixed.

The issue occurs when data is excluded in the config. Those reads with read groups not in the config file are assigned to the first library in the config. Config files where all read groups are present will not have this issue.

Check out the config_fix branch and give it a try.