DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

Symlink creation error when step fails #3757

Closed gmloose closed 3 years ago

gmloose commented 3 years ago

Symlink creation error when step fails

Problem description

In certain situations, when a step fails with a non-zero exit status, Toil tries to create symbolic links to a temporary directory and the files inside it. This fails with a FileExistsError exception. What seems to happen is the following:

  1. Create a symbolic link to a temporary directory.
  2. Create a symbolic link to a file in the temporary directory, using the symbolic link created in the previous step. This second step will obviously fail; the link would overwirte the file in the temporary directory, because the link created in the first step will be dereferenced.

Or, conceptually:

  1. ln -s /tmp/dir dir
  2. ln -s /tmp/dir/file dir/file

It is not exactly clear under which circumstances Toil tries to create these symbolic links. It appears that the following pre-conditions have to be met:

  1. The workflow needs to contain more than one step.
  2. The second (or last?) step must fail.
  3. The failing steps must have more than one input parameter, but these input parameters don't have to be used.

Demonstration

The workflow consists of two steps. The first step will select the first entry from a list of directories, and pass that entry as input to the second step. The second step will simply generate an exit status 1. The extra input parameter min_separation is not used, but the error only occurs when it is specified. The workflow validates without warnings and runs without errors using cwltool. When using toil-cwl-runner, I get a WorkflowException due to an unhandled FileExistsError exception. You can run the workflow as follows:

$ toil-cwl-runner workflow.cwl job.json

The code can be downloaded from https://github.com/gmloose/toil_symlink_bug

Software versions being used

python : 3.6.9
toil   : 5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7
cwltool: 3.0.20201203173111

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-997

adamnovak commented 3 years ago

The expected behavior here would be for the workflow to fail with a different exception, noting that the failing job had failed, right?

We actually hooked up CWL execution to use Toil's FileStore in f194cbfc206f12b8016009205428ba10db2d95b0. That's after the version you're using, and it more or less replaces all the file staging logic that was previously used. So we'll have to see if that ended up fixing this issue. But it still might exist with the --bypass-file-store option, even in the current development version.

Presumably you are using Toil's default --retryCount of 1, so the failed job will be retried. Maybe cwltool is doing something to the filesystem in the first run that isn't cleaned up and trips it up when it goes to stage files for the second run.

gmloose commented 3 years ago

I retried with the current master branch and the issue is gone. In that case, I do have to pass the --bypass-file-store option by the way, because the temporary output files must be stored on a shared file system. I didn't fiddle with --retryCount, so I cannot tell if increasing that would also solve the issue.

adamnovak commented 3 years ago

I do have to pass the --bypass-file-store option by the way, because the temporary output files must be stored on a shared file system.

I think you can also tell toil-cwl-runner to put intermediates on your shared filesystem even without bypassing the file store, with --jobStore to set the directory for files that need to move between jobs, and --workDir to set where the per-job scratch directories go.

Glad to hear you have this working now! I'm going to close the issue.