DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
879 stars 237 forks source link

Virtualized file not found in toil-wdl-runner when multiple stdout() calls in output "collide" #4959

Closed stxue1 closed 2 weeks ago

stxue1 commented 4 weeks ago

In this workflow:

version 1.1
workflow wf{
  call collide_task
  output {
    String s = collide_task.s
    File f = collide_task.f
  }
}
task collide_task {
  input {
  }

  command <<<
    echo hello
  >>>

  output {
    String s = read_string(stdout())
    File f = stdout()
  }

  runtime {
    container: "ubuntu:latest"
  }
}

The existence of String s and File f both being dependent on stdout() in collide_task results in a runtime error:

[2024-05-31T14:23:57-0700] [MainThread] [I] [toil] Running Toil version 7.1.0a1-ccf57e6071e32675daabdcbacb91988e871745a9 on host pop-os.
[2024-05-31T14:23:59-0700] [MainThread] [I] [toil.leader] 0 jobs are running, 0 jobs are issued and waiting to run
[2024-05-31T14:23:59-0700] [MainThread] [I] [toil.leader] Issued job 'WDLTaskJob' wf.collide_task.command kind-WDLTaskJob/instance-rd3xllth v1 with job batch system ID: 2 and disk: 2.0 Gi, m
emory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-05-31T14:24:04-0700] [MainThread] [I] [toil.leader] Finished toil run successfully.

Workflow Progress 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 (0 failures) [00:06<00:00, 0.54 jobs/s]
Traceback (most recent call last):
  File "/home/heaucques/Documents/toil/venv3.12/bin/toil-wdl-runner", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 141, in decorated
    return decoratee(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 3172, in main
    output_bindings = map_over_files_in_bindings(output_bindings, devirtualize_output)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1272, in map_over_files_in_bindings
    return map_over_typed_files_in_bindings(bindings, lambda _, x: transform(x))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1262, in map_over_typed_files_in_bindings
    return environment.map(lambda b: map_over_typed_files_in_binding(b, transform))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/venv3.12/lib/python3.12/site-packages/WDL/Env.py", line 151, in map
    fb = f(b)
         ^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1262, in <lambda>
    return environment.map(lambda b: map_over_typed_files_in_binding(b, transform))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1281, in map_over_typed_files_in_binding
    return WDL.Env.Binding(binding.name, map_over_typed_files_in_value(binding.value, transform), binding.info)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1307, in map_over_typed_files_in_value
    new_path = transform(value.type, value.value)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 1272, in <lambda>
    return map_over_typed_files_in_bindings(bindings, lambda _, x: transform(x))
                                                                   ^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 3169, in devirtualize_output
    return ToilWDLStdLibBase.devirtualize_to(filename, output_directory, toil, execution_dir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/heaucques/Documents/toil/src/toil/wdl/wdltoil.py", line 689, in devirtualize_to
    raise RuntimeError(f"Virtualized file {filename} looks like a local file but isn't!")
RuntimeError: Virtualized file /tmp/toilwf-c36d0778387b5833a0e499628aa69737/efa6/job/stdout.txt looks like a local file but isn't!

The file in /tmp gets deleted even though it shouldn't be. This behavior/error isn't present when I remove the read_string() function call, and thus is only present when read_string(stdout() and stdout() are both present and "colliding".

This happens at the very end of the run after the WDL workflow is fully ran: https://github.com/DataBiosphere/toil/blob/52f1469dc16f1daeffb377bc73e16f0b90cea221/src/toil/wdl/wdltoil.py#L3161-L3169

It seems like the outputs of WDLTaskJob is correct though. Somewhere after the task execution but before the final devirtualization step, the file gets deleted. I'm currently unsure where the file is being deleted and why it is dependent on this "collision".

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1583