Open sb43 opened 7 months ago
Following our discussion I created a Dockerfile to install the dependencies in a single image and very small script focusing on the Bio::DB::HTS
call to get_features_by_location
that we identified as returning incorrect values:
Reproduce issue in cgppindel 3.10.0 image:
# quay.io/wtsicgp/cgppindel:3.10.0
$ docker run --rm -ti -v $PWD/tmp:/var/spool/data quay.io/wtsicgp/cgppindel:3.10.0 perl /var/spool/data/htslib-bind.pl | wc -l
2
Command runs docker, mounting a folder containing the script and bam/bai files, inline executes the script and counts the reads.
$ tree tmp/
tmp/
├── htslib-bind.pl
├── normal.sample.dupmarked.bam
└── normal.sample.dupmarked.bam.bai
Reproduce and prove fix with minimal image for speed of rebuild:
# reproduced: hts-lib-1.12
$ docker run --rm -ti -v $PWD/tmp:/var/spool/data hts-lib-1.12 perl /var/spool/data/htslib-bind.pl | wc -l
2
# fixed: hts-lib-1.20
$ docker run --rm -ti -v $PWD/tmp:/var/spool/data hts-lib-1.20 perl /var/spool/data/htslib-bind.pl | wc -l
34
You need to update all of you image stack to 1.20 to ensure this isn't impacting any other tooling. For pindel specifically this is:
You should review the other code stacks as most are built on top of PCAP-core.
htslib-bind.pl
Expects files to be mounted to /var/spool/data/normal.sample.dupmarked.bam
(inc index).
#!perl
use strict;
use Const::Fast qw(const);
use Bio::DB::HTS;
const my $F_UNMAPPED => 4;
const my $F_NOT_PRIMARY => 256;
const my $F_QC_FAIL => 512;
const my $F_DUPLICATE => 1024;
const my $F_SUPP_ALIGN => 2048;
const my @FILTER_READS_IF_FLAG => ($F_DUPLICATE, $F_UNMAPPED, $F_SUPP_ALIGN, $F_QC_FAIL, $F_NOT_PRIMARY);
my $sam = Bio::DB::HTS->new(-bam => '/var/spool/data/normal.sample.dupmarked.bam');
my $seqid = 'chr2';
my $r_start = 202555406;
my $r_end = 202555406;
my $align_iter = $sam->get_features_by_location(
-seq_id => $seqid,
-start => $r_start,
-end => $r_end,
-filter => \&_read_filter,
-iterator => 1,
);
while(my $a = $align_iter->next_seq) {
print $a->qname."\n";
}
sub _read_filter {
my $flag = shift->flag;
for my $test(@FILTER_READS_IF_FLAG) {
return 0 if(($flag | $test) == $flag);
}
return 1;
}
This issue has been resolved in cgpPindel 3.11.0
I have tested different versions of pindel to narrow down the read counting issue, it seems the issue has been observed after version v3.3.0 Versions tested and issue reproduced. pindel_3.10.0 (current version) pindel_3.5.0 pindel_3.7.0
VCF record showing difference in DP and PD counts in both tumour and normal samples between pindel versions.
Correct call pindel version (V3.3.0)
tabix ../WTSI-COLO_170_1pre_pindel_old_version_3.3.0/WTSI-COLO_170_1pre_vs_WTSI-COLO_170_b.flagged.vcf.gz chr2:202555406-202555406 chr2 202555406 897c3bb4-f119-11ee-90f0-9928a4fd4733 GA G 1200 PASS FF017;LEN=1;PC=D;RE=202555414;REP=7;RS=202555406;S1=105;S2=1683.32 GT:PP:NP:PB:NB:PD:ND:PR:NR:PU:NU:FD:FC:MTR:WTR:AMB:UNK:VAF ./.:0:0:0:0:15:19:15:19:0:0:34:0:0:32:0:0:0.0000 ./.:14:6:15:8:35:27:35:27:15:8:62:23:25:38:5:0:0.3968
Output from all affected versions
tabix pindel_3.10.0/WTSI-COLO_170_1pre_vs_WTSI-COLO_170_b.flagged.vcf.gz chr2:202555406-202555406 chr2 202555406 97380cb8-f75b-11ee-b828-bb7ec1a1a3fd GA G 1200 FF018 PC=D;RS=202555406;RE=202555414;LEN=1;S1=105;S2=1683.32;REP=7;FF017 GT:PP:NP:PB:NB:PD:ND:PR:NR:PU:NU:FD:FC ./.:0:0:0:0:2:0:2:0:0:0:2:0 ./.:14:6:1:0:2:1:16:7:15:6:23:21
VCF_HEADER:
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FORMAT=
FILTER= 10">
Thanks, Shriram