In alignment pipeline, histmet is not getting generated, breaking SV calling.

glebkuznetsov commented 10 years ago

On the Fix Recoli AWS server, the following fails. Why?

java -Xmx1024M -jar tools/picard/CollectInsertSizeMetrics.jar I=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.bam O=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.insertmet.txt HISTOGRAM_FILE=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.histmet.txt VALIDATION_STRINGENCY=LENIENT

glebkuznetsov commented 10 years ago

Here's the output from running that command:

[Wed Apr 02 19:55:15 UTC 2014] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.histmet.txt INPUT=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.bam OUTPUT=/millstone_data/temp_data/projects/92f50c8a/alignment_groups/c359d2cf/sample_alignments/2a07d767/bwa_align.sorted.grouped.realigned.withmd.insertmet.txt VALIDATION_STRINGENCY=LENIENT DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Wed Apr 02 19:55:15 UTC 2014] Executing as ubuntu@ip-10-43-157-10 on Linux 3.11.0-15-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_51-b00; Picard version: 1.96(1510) INFO 2014-04-02 19:55:18 SinglePassSamProgram Processed 1,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000: 2s. Last read position: NC_000913:2,066,112 INFO 2014-04-02 19:55:20 SinglePassSamProgram Processed 2,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 2s. Last read position: NC_000913:4,069,965 WARNING 2014-04-02 19:55:21 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data. WARNING 2014-04-02 19:55:21 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0 [Wed Apr 02 19:55:21 UTC 2014] net.sf.picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=309329920

dbgoodman commented 10 years ago

This is what bwa mem looks like when I run things by hand:

tools/bwa/bwa mem /millstone_data/temp_data/projects/92f50c8a/ref_genomes/6d4c8605/mg1655_kqrJFx.fa <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/399e6680/LIB007279_GEN00012930_GCGCTA_L001_R1.fastq.bz2) <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/399e6680/LIB007279_GEN00012930_GCGCTA_L001_R2.fastq.bz2) | tools/samtools/samtools view -bS  - > /tmp/testbwa.bam

Output:

[M::main_mem] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 31382, 2, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (145, 181, 240)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 430)
[M::mem_pestat] mean and std.dev: (196.08, 72.05)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 525)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::worker2@0] performed mate-SW for 1769 reads
...

And in the pipeline it looks like:

 [M::main_mem] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::worker2@0] performed mate-SW for 0 reads
...

So....it has nothing to do with the order in which insertmetrics is called, it is a problem with running BWA as part of the pipeline versus manually via the command line.

dbgoodman commented 10 years ago

This sample succeeds:

tools/bwa/bwa mem \
    /millstone_data/temp_data/projects/92f50c8a/ref_genomes/6d4c8605/mg1655_kqrJFx.fa <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/399e6680/LIB007279_GEN00012930_GCGCTA_L001_R1.fastq.bz2) <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/399e6680/LIB007279_GEN00012930_GCGCTA_L001_R2.fastq.bz2) | tools/samtools/samtools view -bS - > /tmp/testbwa.bam

And this one fails:

tools/bwa/bwa mem \
   /millstone_data/temp_data/projects/92f50c8a/ref_genomes/6d4c8605/mg1655_kqrJFx.fa <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/0020ff78/LIB007279_GEN00012924_GTTACC_L001_R1.fastq.bz2) <(bzcat /millstone_data/temp_data/projects/92f50c8a/samples/0020ff78/LIB007279_GEN00012924_GTTACC_L001_R1.fastq.bz2) | tools/samtools/samtools view -bS - > /tmp/testbwa.bam

dbgoodman commented 10 years ago

Culprit found: https://github.com/churchlab/millstone/blob/master/genome_designer/scripts/import_util.py#L369

The R2 dataset was being linked up to the R1 file.

churchlab / millstone

In alignment pipeline, histmet is not getting generated, breaking SV calling. #235