@drjsanger do you agree this is an appropriate patch based on the findings and examples below?
When a bed file is used for UM panel the a new tabix object is created for each reference (for compatibility with VCF model). By creating a single tabix object and saving the reference to each pointer instead memory can be vastly reduced.
This is more obvious in GRCh38 as >3000 contigs.
I wrote a small script based on the loading method used in cgpFlagCaVEMan.pl to illustrate the point:
Current method (stopping at 1k chrs):
$ /usr/bin/time -f 'Max resident size (kb): %M\nElapsed wall (s): %e' \
perl sizeOfTabix.pl genome.fa.fai unmatchedNormal.bed.gz 0
Unique mode: New tabix object for each chromosome.
Exiting after 1000 chrs or memory explodes
Max resident size (kb): 7408852
Elapsed wall (s): 21.93
New method (stopping at 1k chrs):
$ /usr/bin/time -f 'Max resident size (kb): %M\nElapsed wall (s): %e' perl sizeOfTabix.pl genome.fa.fai unmatchedNormal.bed.gz 1
Shared mode: ALL chromosomes share SINGLE tabix object.
Exiting after 1000 chrs or memory explodes
Max resident size (kb): 23872
Elapsed wall (s): 0.09
As you can see >7GB down to ~23MB, plus far faster (and less I/O). This is an extra driver for the production system to switch away from the VCF method also.
Script:
#!/usr/bin/perl
use strict;
use Bio::DB::HTS::Tabix;
my ($refFai, $bedloc, $shared) = @ARGV;
my $fileList = {};
if($shared) {
print "Shared mode: ALL chromosomes share SINGLE tabix object.\n"
} else {
print "Unique mode: New tabix object for each chromosome.\n"
}
my $tbibed;
open(my $REF, '<', $refFai) or croak("Error opening reference index $refFai: $!");
while(my $line = <$REF>){
next if($line =~ m/^\s*#/);
chomp($line);
my ($chr,undef) = split(/\t/,$line);
if($shared) {
$tbibed = Bio::DB::HTS::Tabix->new(filename => $bedloc) unless($tbibed);
$fileList->{$chr} = $tbibed;
} else {
$fileList->{$chr} = Bio::DB::HTS::Tabix->new(filename => $bedloc);
}
if($. == 1000) {
print "Exiting after $. chrs or memory explodes\n";
last;
}
}
close($REF);
@drjsanger do you agree this is an appropriate patch based on the findings and examples below?
When a bed file is used for UM panel the a new tabix object is created for each reference (for compatibility with VCF model). By creating a single tabix object and saving the reference to each pointer instead memory can be vastly reduced.
This is more obvious in GRCh38 as >3000 contigs.
I wrote a small script based on the loading method used in
cgpFlagCaVEMan.pl
to illustrate the point:Current method (stopping at 1k chrs):
New method (stopping at 1k chrs):
As you can see >7GB down to ~23MB, plus far faster (and less I/O). This is an extra driver for the production system to switch away from the VCF method also.
Script: