ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Passing an Environment Variable to a Cluster in Clusterflow #83

Closed machbio closed 8 years ago

machbio commented 8 years ago

Clusterflow works with the notion that the when a user loads clusterflow and runs a job - his user environment including the modules are loaded with #$ -V parameter in the Qsub command like;

/* @custom_job_submit_command qsub -cwd -V -S /bin/bash -MY --CUSTOM --PARAMS -pe orte {{cores}} {{command}} */

It is advisable in the future versions of Clusterflow that this approach needs to move to a better solution, where the user environment is not loaded and each of the jobs running have their own modules being loaded in the Qsub run script. As, the #$ -V is totally unreliable in different configuration of Clusters. We can discuss on the Pros and Cons of this approach, but just wanted to bring to the Developers notice that this problem exists in Our Cluster Configuration.

ewels commented 8 years ago

Hi @machbio,

Do you have any suggestions for a better alternative? I don't have access to an SGE cluster any more.

I suspect that reorganising the code to have independent environments for each job will be a large task. The majority of users don't seem to have this problem, so I can't promise that this change will happen soon. I'll bear it in mind for when I next have some time to work on Cluster Flow though.

Phil

machbio commented 8 years ago

Thanks Phil - I know it's a huge undertaking, I did not expect that it would be done. I just wanted to let you know that exporting user environment into cluster job is unreliable in most cases.

As of now - I am dealing with the problem by manually loading it as below;

 @custom_job_submit_command qsub -cwd  -S /bin/bash -pe node  {{cores}} source /etc/profile.d/modules_bash.sh ; module add clusterflow;  {{command}}

In addition, I was not able to understand - how you start loading the fastqc, bismark modules ?

ewels commented 8 years ago

Ah ok, great to know that you managed to get it working like that.

Cluster Flow has a function called load_environment_modules which tries to automatically load the required modules. This is called when Cluster Flow launches and collects the requirements from all of the modules which will run, so again relies on the head node environment being passed on to the compute nodes.

I suspect that it might be easiest for you to create a bash script which loads all environment modules ever used by Cluster Flow in your custom job submit command. Cluster Flow just expects tools to be available on the command line, so it doesn't really matter how you make them available.

Phil

machbio commented 8 years ago

Thanks for the explanation about loading the modules.

Yes, I will have to write a bash script that loads all the modules ever required by Clusterflow - as of now I am trying to scrape through all the modules that is required by Clusterflow. The hard part is convincing the sysadmin to make this new bash script to be available on all the cluster nodes.

Closing the ticket - as my scenario seems to be a special case.