broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

Shared filesystem for multiple nodes - clarification #5802

Open OgnjenMilicevic opened 4 years ago

OgnjenMilicevic commented 4 years ago

I am reading your tutorial on HPC and I have a question that could be very uneducated. The shared filesystem section talks about the localization strategies for inputs, which is certainly an issue, but the outputs are not mentioned.

Let's say I have several nodes in a cluster and a single shared volume between them, either physical or software one (like Lustre). I am using Slurm backend and any node can end up running any task based on internal Slurm scheduling. Ideally I would want each task to copy the inputs from the shared volume to a local folder, create outputs, and then copy outputs to the shared volume. I know one can output final outputs anywhere, but how can one control what happens to intermediate files? The problem would arise if the subsequent tasks in the workflow are done on different nodes, but is enforcing (one node)/(one wf execution) even possible? Even if it is it beats the point of scheduling resources by availability.

The solution I can see is running Cromwell FROM the shared volume, but then everything would happen there and tiny inputs and outputs would choke the job and possibly cause wear on hardware. Unless I can set a temp directory while the outputs are written?

I am asking because I am not experienced and would like to know if there are solutions I am missing before I end up doing development on my own. Thanks!

EugeneEA commented 2 years ago

If some one can comment on this issue or point to the right reference for investigation I would be very gratefull ... @OgnjenMilicevic did you find answer for your question eventually? Best, Eugene