ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

warning about undefined $GENOME #15

Closed ewels closed 9 years ago

ewels commented 9 years ago

There's a warning about undefined $GENOME when launching pipelines which don't need it. This is from L891 in cf and a couple of other places around there. Because no genome is specified then the interpolation there is wrong. I initially thought you'd just need to make $GENOME an empty string, but I think it's worse than that since you'll effectively end up passing the wrong number of arguments through to the memory request function if $GENOME is empty. I haven't looked at this thoroughly enough to see what the right fix is but I guess you'll probably know.

s-andrews commented 9 years ago

Since this is still hitting us an confusing people I've had another look. The offending code is the code which pulls out the number of cores to be used:

# Get options for module. 2 - negotiate cores                                                                           
my $sim_argv = "$runfn $GENOME cores_placeholder $parameters";
my $cores = `$module_fn $sim_argv --cores $TOTAL_CORES`;
if(!$cores || length($cores) == 0 || $cores =~ /\D/){
  warn "\nWarning: no cores allocation or error parsing cores allocation: setting to 1Gb\n";
  $cores = 1;
}

Since the $sim_argv string is only used to pass along to the the call to the module, and since that module is called with --cores then I don't see what $sim_argv is doing anyway. The API says that if --cores is called then the only argument will be the total offered cores, so the genome and other stuff shouldn't be there should it?

s-andrews commented 9 years ago

Have confirmed that removing the $sim_argv declarations above and taking $sim_argv out of the backticks command on the subsequent lines for both the --cores and --mem calls don't have an adverse effect on the functioning of the system and don't produce the warning so will patch our code this way.

It looks like this might have been an half implementation of a system where you could (for example) allocate different amounts of memory based on the genome you were searching, but since the API doesn't say anything about this it's not something which should be implemented this way anyhow.

ewels commented 9 years ago

Hi @s-andrews,

Apologies for not getting around to looking at this for you yet. This code is from the recent pull request #13 by @stu2 - as it is still in the development branch I haven't got to fully testing it, and it's not documented at all yet.

As you say, I think that the intention behind this code is to allocate different amounts of memory. Specifically, I believe that @stu2 wanted to allocate memory based on the reference genome size in the STAR module.

I like the idea of modules being able to dynamically request resources based on information available at run time. Can we not swap unset variables with a recognised string (eg. False?) to avoid the warnings but keep the functionality in?

Phil

s-andrews commented 9 years ago

Sure, that would work too. The main thing here would be that you don't print a potentially undefined string to the command you're running. You haven't updated the API to say what is going to be passed to the modules, but if you're doing to do it as a single set of arguments then you'll need to have a flag value, as otherwise it will just shift all of the other arguments along if it's missing.

What I wasn't clear about is why you'd need to pass the genome anyway. From what I could see you were already passing in the run file which contains that information - unless it's just to save reloading the file, in which case why pass in the run file at all?

ewels commented 9 years ago

Hmm, yeah, reading the run file seems a lot easier. @stu2 - is there any reason why you took the approach that you did?

stu2 commented 9 years ago

Hi, apologies for the warning messages, I hadn't considered the case of pipelines not requiring genomes.

Actually, the value held in $GENOME (which is a keyword, not a filename) is not accessible in the run-files as far as I can see, that's why I added it. What is in the runfile is the paths to the Bowtie/STAR/GTF index files. But I don't mind if we stop passing $GENOME to the cfmods, because my star.cfmod actually does pretty much what you suggested, it grabs the STAR index filepaths from the runfile and measures how big the files are, so it doesn't actually use $GENOME. I just thought that in future, someone making cfmods might find it useful to have access to that keyword for memory allocation. (Perhaps it could be added to the runfile too?)

Nonetheless, expect star.cfmod to break if you don't pass it the same number of args for the memory negotiation. But it shouldn't be to hard for me to fix. Minimally, as long cf passes $runfn (and preferably $parameters as well, although it might risk provoking the same undefined string issue) it should be workable. FYI, samtools_sort.cfmod is currently in the same boat.

Cheers, Stuart

ewels commented 9 years ago

Ok, I think that it would be cleanest to handle these cases by passing the path to the run file to the module and using that to store all required config information.

I don't really like relying on option order, so I'll use a new flag for the run file name instead.

@stu2 and @s-andrews - is everyone happy with this solution? @stu2, I'll do a little bit of testing, but if you could confirm that everything still works as you expect it to after my changes, that would be brilliant.

I'll reference this issue with my commits so that they show up.

Cheers all,

Phil

s-andrews commented 9 years ago

I think that passing the run file is the best way to handle this but why not simply have --mem and --cores take the runfn as an argument?

module --mem=my.run module --cores=cores

Those are just flags at the moment so you get a free bit of data to pass around and any module which doesn't need it can just ignore it (so you don't need to change any existing modules). Also by passing the run file there's never going to be a need to add anything else as every bit of information about the submission is therefore accessible.

Simon.

PS You may have seen that we've forked clusterflow so we can keep track of the changes we've made locally. I've actually been fixing a bunch of issues we've come across (including some of the cleanup you mention in these edits) so feel free to pull anything you like from our repository.

ewels commented 9 years ago

I don't think --mem and --cores are just flags though, Cluster Flow tries to be clever and passes in a suggested number of available cores. For instance, in the bismark_align module:

if($required_cores){
    print CF::Helpers::allocate_cores($required_cores, 4, 8);
    exit;
}

Here, CF is passing in a suggested number of cores in the $required_cores variable and the module is using that number as long as it's between 4 and 8.

And yes, that was my thought about passing the run file - it effectively future proofs the system as we can put whatever information we like into there.

Great news about the CF fork! I was just chatting to @FelixKrueger about it actually, I think that you have to initiate a Pull Request to my parent repository though, I don't think I can pull it from you as such.

Phil

ewels commented 9 years ago

With the h_vmem option being removed (see https://github.com/ewels/clusterflow/issues/26) this has become even simpler. Basically just the STAR module calculating how much memory it needs from the reference.

I think this is all completed now anyway. Closing..