cancerit / cgpCaVEManPostProcessing

Flagging add on to CaVEMan
http://cancerit.github.io/cgpCaVEManPostProcessing/
GNU Affero General Public License v3.0
2 stars 4 forks source link

WIP: Reduce memory overhead of unmatched normal bed by using single shared reference #32

Closed keiranmraine closed 5 years ago

keiranmraine commented 5 years ago

@drjsanger do you agree this is an appropriate patch based on the findings and examples below?

When a bed file is used for UM panel the a new tabix object is created for each reference (for compatibility with VCF model). By creating a single tabix object and saving the reference to each pointer instead memory can be vastly reduced.

This is more obvious in GRCh38 as >3000 contigs.

I wrote a small script based on the loading method used in cgpFlagCaVEMan.pl to illustrate the point:

Current method (stopping at 1k chrs):

$ /usr/bin/time -f 'Max resident size (kb): %M\nElapsed wall (s): %e' \
  perl sizeOfTabix.pl genome.fa.fai unmatchedNormal.bed.gz 0
Unique mode: New tabix object for each chromosome.
Exiting after 1000 chrs or memory explodes
Max resident size (kb): 7408852
Elapsed wall (s): 21.93

New method (stopping at 1k chrs):

$ /usr/bin/time -f 'Max resident size (kb): %M\nElapsed wall (s): %e' perl sizeOfTabix.pl genome.fa.fai unmatchedNormal.bed.gz 1
Shared mode: ALL chromosomes share SINGLE tabix object.
Exiting after 1000 chrs or memory explodes
Max resident size (kb): 23872
Elapsed wall (s): 0.09

As you can see >7GB down to ~23MB, plus far faster (and less I/O). This is an extra driver for the production system to switch away from the VCF method also.

Script:

#!/usr/bin/perl

use strict;
use Bio::DB::HTS::Tabix;

my ($refFai, $bedloc, $shared) = @ARGV;

my $fileList = {};
if($shared) {
  print "Shared mode: ALL chromosomes share SINGLE tabix object.\n"
} else {
  print "Unique mode: New tabix object for each chromosome.\n"
}

my $tbibed;
open(my $REF, '<', $refFai) or croak("Error opening reference index $refFai: $!");
while(my $line = <$REF>){
  next if($line =~ m/^\s*#/);
  chomp($line);
  my ($chr,undef) = split(/\t/,$line);
  if($shared) {
    $tbibed = Bio::DB::HTS::Tabix->new(filename => $bedloc) unless($tbibed);
    $fileList->{$chr} = $tbibed;
  } else {
    $fileList->{$chr} = Bio::DB::HTS::Tabix->new(filename => $bedloc);
  }
  if($. == 1000) {
    print "Exiting after $. chrs or memory explodes\n";
    last;
  }
}
close($REF);
ghost commented 5 years ago

Thanks Keiran. Such a simple fix too.