DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
896 stars 241 forks source link

toil-cwl-runner with workDir still using default temporary directory #1897

Closed wdesouza closed 5 months ago

wdesouza commented 7 years ago

I am testing toil-cwl-runner with --workDir parameter to avoid temporary files in /tmp directory but it is not working. Toil still writes temporary files at /tmp directory. Is it possible to instruct Toil to write all temporary and cached files in user specified directory?

Command line:

mkdir /home/welliton/tmp
toil-cwl-runner --outdir /home/welliton/results --logFile logs.txt --workDir /home/welliton/tmp workflow.cwl inputs.yml

Toil version: 3.11.0

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-206

joelarmstrong commented 7 years ago

Grepping around the toil source code it does look like there are still a few files in /tmp created by toil itself. But there shouldn't be many of them, and they should be pretty small.

How big are these files? They might be created by your CWL workflow rather than Toil itself.

You could try setting the environment variable TMPDIR=/home/welliton/tmp before running: that should force everything to go to the directory you want.

wdesouza commented 7 years ago

I am processing multiple FASTQ files (total size 54 GB) using Docker containers. CWL file: https://github.com/labbcb/tool-rqc/blob/master/Rqc.cwl

First Toil created the directory /tmp/tmpwzXyvO/ with 56 GB. Then, Toil created the directory /home/welliton/tmp/toil-f0b247e8-b6b8-4fda-995e-09ff3f10988f-a8c12647ed20e97601dcb817551088b8 with 54 GB. I guess Toil is copping input files twice before executing workflow step. After completed, Toil deleted both temporary directories.

I tested with environment variable but Toil still creates temporary directory in /tmp.

mr-c commented 7 years ago

@Welliton309 If Toil is using hardlinks then it might not be making whole new copies

wdesouza commented 7 years ago

@mr-c I have used the commands du -hcs <dir> and df -h. I noticed this behavior because Toil failed with error message "no free disk space". I had to clean up /tmp and run workflow again. I my case the /home directory is in different partition and there is more free disk space than /tmp.

mr-c commented 7 years ago

@Welliton309 Okay -- I suspected you had checked but I wanted to be sure.

brucehoff commented 6 years ago

We encountered this problem as well. To repro, use this .cwl 'tool' that prints the current directory:

#!/usr/bin/env cwl-runner
#
#  This sample workflow simply prints the current directory
#
cwlVersion: v1.0
class: CommandLineTool
baseCommand: pwd
inputs: []

stdout: stdout.txt

outputs:
  - id: stdout
    type: File
    outputBinding:
      glob: stdout.txt

Run

toil-cwl-runner pwd.cwl

Result is a file stdout.txt containing the path to the default temp dir. We should be able to override it with:

toil-cwl-runner --workDir /some/other/path pwd.cwl

however the path in stdout.txt is the same.

On one platform it works to define env variables TMP, TEMP, TMPDIR to the desired path, but this workaround doesn't work universally.

unito-bot commented 5 months ago

➤ Adam Novak commented:

We don’t think that this is still likely to be a problem, but we’ll check since we’re revising workdir selection for another issue.

stxue1 commented 5 months ago

We encountered this problem as well. To repro, use this .cwl 'tool' that prints the current directory:

#!/usr/bin/env cwl-runner
#
#  This sample workflow simply prints the current directory
#
cwlVersion: v1.0
class: CommandLineTool
baseCommand: pwd
inputs: []

stdout: stdout.txt

outputs:
  - id: stdout
    type: File
    outputBinding:
      glob: stdout.txt

Run

toil-cwl-runner pwd.cwl

Result is a file stdout.txt containing the path to the default temp dir. We should be able to override it with:

toil-cwl-runner --workDir /some/other/path pwd.cwl

however the path in stdout.txt is the same.

On one platform it works to define env variables TMP, TEMP, TMPDIR to the desired path, but this workaround doesn't work universally.

Seems like this is no longer an issue: Running toil-cwl-runner --workDir test_subdir pwd.cwl: My stdout.txt outputs:

/home/heaucques/Documents/toil-examples/test_subdir/toilwf-a4200ceea6145dc0abb756e6f4516be9/c578/job/tmpxd5xlrrv/tmp-outc3gxkh0a
unito-bot commented 5 months ago

➤ Adam Novak commented:

stxue says this is no longer a problem.