broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
190 stars 67 forks source link

vphaser p-value computation is unstable #97

Closed iljungr closed 9 years ago

iljungr commented 9 years ago

When V-Phaser 2 is tested on the sample file with which it is distributed, 4528.454.indelRealigned.bam, some of the p-values in the resulting .txt files vary from run to run. Only the .txt files vary; the other files are the same. After running it 10 times, 5 of them produced the same p-values, 3 of the remaining ones produced the same p-values as each other, and the p-values in the remaining 2 runs were different from each other and from the other runs.

Here's a sample diff:

diff output1/V4528_assembly.var.raw.txt output2 32c32

< 2872 A G 0.2215 snp 0.7186 A:1:5 G:436:393

2872 A G 0.3361 snp 0.7186 A:1:5 G:436:393 41c41

< 3260 C A 0.4104 snp 0.7018 A:506:343 C:1:5

3260 C A 0.5249 snp 0.7018 A:506:343 C:1:5 57c57

< 4236 G A 0.6867 snp 0.5566 A:534:538 G:2:4

4236 G A 0.6939 snp 0.5566 A:534:538 G:2:4 134c134

< 8191 A T 0.6692 snp 1.354 A:1:5 T:296:141

8191 A T 0.7388 snp 1.354 A:1:5 T:296:141

For now, TestVPhaser works around the instability by not comparing the .txt files. That should be fixed when the instability is fixed.

iljungr commented 9 years ago

Since we do not expect to use these p-values downstream, this is not an important issue for the viral-ngs pipeline.

The issue has been posted on the Broad Viral Tool Users forum here: https://groups.google.com/forum/?hl=en&fromgroup#!topic/viral-tool-users/zYnN6b4vdLw