cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
329 stars 93 forks source link

job script `mkdir` tweak #6000

Open hjoliver opened 7 months ago

hjoliver commented 7 months ago

Part of the job script boilerplate in cylc/flow/etc/job.sh:

  # Create share and work directories
    mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" || true
    mkdir -p "$(dirname "${CYLC_TASK_WORK_DIR}")" || true

I've traced the origin of this scripting to PR #17 🤯

It would be good if we could remove the || true fail-safes, to make the task fail if those "directories" are actually dangling symlinks (e.g. the symlinked data dirs got whacked by a disk failover).

We've speculated that the || true was meant to protect against multiple tasks trying to create these directories at the same time, but mkdir -p should have that covered.

However, @oliver-sanders correctly pointed out that changing anything this fundamental is risky.

Ping @matthewrmshin - if you're listening, as the author of that PR, do you recall your thought process from late September 2011? If so, you deserve a prize, but maybe it's worth asking!

matthewrmshin commented 7 months ago

Your speculation is most likely correct. It was to avoid multiple tasks (i.e., different jobs/processes on different nodes on an HPC/cluster) creating the same directories at the same time on a networked file system.

mkdir -p is definitely good enough when you are on a local file system. However, I don't really know how networked file systems will behave these days when you have multiple processes on multiple nodes trying to create the same directories.

Also bear in mind that we had to handle both ksh and bash running on machines that are not GNU/Linux in the early days of Cylc. These days we only have to handle modern GNU/Linux systems, so at least you can rely on a more uniformly behaving mkdir -p.

(An alternate way to implement this is to use a while [[ ! -d "${dir}" ]]; do ... done loop with logic in the block to handle dangling symlinks.)

hjoliver commented 7 months ago

Thanks for responding @matthewrmshin ! Good point on networked filesystems, makes sense. We might have to investigate a bit...