SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
89 stars 29 forks source link

memory as a command-line argument? #17

Closed JA-Lacey closed 4 years ago

JA-Lacey commented 5 years ago

Have PIRATE running seemed to work well on smaller dataset. But in large dataset Cd-Hit runs out of memory. It appears to be hardcoded as -M 2450. Can you make this a command-line argument?

Thanks.

SionBayliss commented 5 years ago

Hi Jake,

CD-HIT shouldn't be hard coded to anything. it is calculated as 3* the size of the input file. In pangenome_construction:

my $m_required = -s "$output_dir/$sample.temp.fasta";
$m_required = int($m_required/1000000); #MB
$m_required *= 3; # triple
$m_required = 2000 if($m_required < 2000); # set minimum

...

my $cd_hit_command = "$cd_hit_bin -i $output_dir/$sample.temp.fasta -o $output_dir/$sample.$i -aS $cdhit_aS -c $curr_thresh -T $threads -g 1 -n $n -M $m_required -d 256 >> $cdhit_log";

#print " - command: \"$cd_hit_command\"\n";
$cd_hit_out = `$cd_hit_command`;

-M isn't specified in classify paralogs (but it probably should be). I will fix that now.

Where are you getting that value from?

p.s. could you use the split_correction branch. It has various bugfixes I am waiting to push to the release version.

S

tseemann commented 5 years ago

Ok so it's not hardcoded, but you are setting it via a formula. This is cool, but might need some fine-tuning. So that number isn't enough for the N that @JA-Lacey is using.

How many genomes did you use @JA-Lacey ? And how many proteins in each one?

JA-Lacey commented 5 years ago

I gave it a go on 2340 genomes each with ~1800 proteins. All the same Sequence type. the -M value is what was printed in the .log @tseemann @SionBayliss

Perhaps the 3* is not quite enough to handle very large datasets.

Thanks

tseemann commented 5 years ago

This is what roary does:

https://github.com/sanger-pathogens/Roary/blob/b4f65339c587b0e07a40433b80fa0f2cbc0e9d87/lib/Bio/Roary/External/Cdhit.pm#L39-L55

it looks like PIRATE has copied the code, but notice it is *5 here now despite the comment that @andrewjpage has :-)

sub _build_memory_in_mb
{
  my ($self) = @_;
  my $filename = $self->input_file;
  my $memory_required = 2000;
  if(-e $filename)
  {
    $memory_required = -s $filename;
    # Convert to mb
    $memory_required = int($memory_required/1000000);
    # Triple memory for worst case senario
    $memory_required *= 5;
    $memory_required = 2000 if($memory_required < 2000);
  }

  return $memory_required;
}
tseemann commented 5 years ago

Maybe put this in a function?

use List::Util qw(max);
use constant CDHIT_MEM_FACTOR => 5;
my $ram = max(2000, int( (-s $filename) * CDHIT_MEM_FACTOR / 1E6 ) );

And re-use it here:

./scripts/pangenome_construction.pl:            $m_required *= 3; # triple
./scripts/run_classify_paralogs_batch.pl:$m_required *= 2; # double
./scripts/run_classify_paralogs_sequential.pl:$m_required *= 3; # triple

I can help with how to make a module with re-useable funcs

SionBayliss commented 5 years ago

As quick fix I modified the m_required from 3x -> 5x in the latest commit to master.

Jake could you try again for me and see if it works? I have run larger dataset that the one mentioned above with no issues so I would like to track down what is causing the memory error.

SionBayliss commented 4 years ago

Hi @JA-Lacey ,

I have made branch 'cdmem' which allows for you to manually specify the amount of memory for CDHIT. On the cdmem branch use the option --pan-opt "--cd-mem *" and replace the asterix with the amount of memory in MB you want assign to cdhit. Does that resolve your issue?

sthifx commented 4 years ago

Hi Sion,

Thanks very much for PIRATE: It's a great tool. I was also having issues with cd-hit memory usage, and I suspect your cdmem branch will fix this nicely. I'm trying to integrate it into a Snakemake pipeline and I was wondering, is there a way to install the cdmem branch via conda?

Thanks again, and all the best

tseemann commented 4 years ago

Ok, this isn't easy for people to test, so here is how you do it

git clone https://github.com/SionBayliss/PIRATE.git
git fetch origin cdmem
git checkout cdmem

It might be easier to just make the change. It's clearly a problem.

SionBayliss commented 4 years ago

Hi @sthifx,

As Torsten said, just remember to use the ./bin/PIRATE executable from the cloned git directory and you should be good. I will push it to master sortly, I am just testing it with the help of @JDHarlingLee.

SionBayliss commented 4 years ago

Hi @sthifx @JA-Lacey ,

The new release version v1.0.1 should resolve the issues you have encountered. It allows you to manually assign more memory to cdhit (-pan-opt "--cd-mem *" ) and should catch and report on any issues with files generated as inputs for cdhit.

Due to error checking the script currently takes additional memory overhead which is unnecessary prior to invoking cdhit. I will remove this in a later commit, but the current release should be sufficient. In the mean time it would be sensible to ensure you have enough memory to run cdhit and blast/diamond and a little extra. Perhaps monitor the beginning on the PIRATE run (the stage up until cdhit begins to run) using top.

With the help of @JDHarlingLee I have been running PIRATE on a dataset of >10,000 genomes and it seems to be performing well. So it shouldn't have any issues with the datasets you have discussed.

I will leave the issue open in case you have any similar issues.

sthifx commented 4 years ago

Hi @SionBayliss

Thanks very much for the new release - I've carried out some testing with -pan-opt "--cd-mem *" and everything seems to be working perfectly.

Thanks again and all the best!

JA-Lacey commented 4 years ago

Yes same with me. All seems to be working fine. Thanks

SionBayliss commented 4 years ago

@JA-Lacey @sthifx

Glad I could help!