Closed JA-Lacey closed 4 years ago
Hi Jake,
CD-HIT shouldn't be hard coded to anything. it is calculated as 3* the size of the input file. In pangenome_construction:
my $m_required = -s "$output_dir/$sample.temp.fasta";
$m_required = int($m_required/1000000); #MB
$m_required *= 3; # triple
$m_required = 2000 if($m_required < 2000); # set minimum
...
my $cd_hit_command = "$cd_hit_bin -i $output_dir/$sample.temp.fasta -o $output_dir/$sample.$i -aS $cdhit_aS -c $curr_thresh -T $threads -g 1 -n $n -M $m_required -d 256 >> $cdhit_log";
#print " - command: \"$cd_hit_command\"\n";
$cd_hit_out = `$cd_hit_command`;
-M isn't specified in classify paralogs (but it probably should be). I will fix that now.
Where are you getting that value from?
p.s. could you use the split_correction branch. It has various bugfixes I am waiting to push to the release version.
S
Ok so it's not hardcoded, but you are setting it via a formula. This is cool, but might need some fine-tuning. So that number isn't enough for the N that @JA-Lacey is using.
How many genomes did you use @JA-Lacey ? And how many proteins in each one?
I gave it a go on 2340 genomes each with ~1800 proteins. All the same Sequence type. the -M value is what was printed in the .log @tseemann @SionBayliss
Perhaps the 3* is not quite enough to handle very large datasets.
Thanks
This is what roary
does:
it looks like PIRATE has copied the code, but notice it is *5
here now despite the comment that @andrewjpage has :-)
sub _build_memory_in_mb
{
my ($self) = @_;
my $filename = $self->input_file;
my $memory_required = 2000;
if(-e $filename)
{
$memory_required = -s $filename;
# Convert to mb
$memory_required = int($memory_required/1000000);
# Triple memory for worst case senario
$memory_required *= 5;
$memory_required = 2000 if($memory_required < 2000);
}
return $memory_required;
}
Maybe put this in a function?
use List::Util qw(max);
use constant CDHIT_MEM_FACTOR => 5;
my $ram = max(2000, int( (-s $filename) * CDHIT_MEM_FACTOR / 1E6 ) );
And re-use it here:
./scripts/pangenome_construction.pl: $m_required *= 3; # triple
./scripts/run_classify_paralogs_batch.pl:$m_required *= 2; # double
./scripts/run_classify_paralogs_sequential.pl:$m_required *= 3; # triple
I can help with how to make a module with re-useable funcs
As quick fix I modified the m_required from 3x -> 5x in the latest commit to master.
Jake could you try again for me and see if it works? I have run larger dataset that the one mentioned above with no issues so I would like to track down what is causing the memory error.
Hi @JA-Lacey ,
I have made branch 'cdmem' which allows for you to manually specify the amount of memory for CDHIT. On the cdmem branch use the option --pan-opt "--cd-mem *" and replace the asterix with the amount of memory in MB you want assign to cdhit. Does that resolve your issue?
Hi Sion,
Thanks very much for PIRATE: It's a great tool. I was also having issues with cd-hit memory usage, and I suspect your cdmem branch will fix this nicely. I'm trying to integrate it into a Snakemake pipeline and I was wondering, is there a way to install the cdmem branch via conda?
Thanks again, and all the best
Ok, this isn't easy for people to test, so here is how you do it
git clone https://github.com/SionBayliss/PIRATE.git
git fetch origin cdmem
git checkout cdmem
It might be easier to just make the change. It's clearly a problem.
Hi @sthifx,
As Torsten said, just remember to use the ./bin/PIRATE executable from the cloned git directory and you should be good. I will push it to master sortly, I am just testing it with the help of @JDHarlingLee.
Hi @sthifx @JA-Lacey ,
The new release version v1.0.1 should resolve the issues you have encountered. It allows you to manually assign more memory to cdhit (-pan-opt "--cd-mem *" ) and should catch and report on any issues with files generated as inputs for cdhit.
Due to error checking the script currently takes additional memory overhead which is unnecessary prior to invoking cdhit. I will remove this in a later commit, but the current release should be sufficient. In the mean time it would be sensible to ensure you have enough memory to run cdhit and blast/diamond and a little extra. Perhaps monitor the beginning on the PIRATE run (the stage up until cdhit begins to run) using top
.
With the help of @JDHarlingLee I have been running PIRATE on a dataset of >10,000 genomes and it seems to be performing well. So it shouldn't have any issues with the datasets you have discussed.
I will leave the issue open in case you have any similar issues.
Hi @SionBayliss
Thanks very much for the new release - I've carried out some testing with -pan-opt "--cd-mem *" and everything seems to be working perfectly.
Thanks again and all the best!
Yes same with me. All seems to be working fine. Thanks
@JA-Lacey @sthifx
Glad I could help!
Have PIRATE running seemed to work well on smaller dataset. But in large dataset Cd-Hit runs out of memory. It appears to be hardcoded as -M 2450. Can you make this a command-line argument?
Thanks.