DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Toil generating too long paths #3380

Closed caballero closed 3 years ago

caballero commented 3 years ago

I am running Toil 5.1.0 in an LSF cluster, one of the steps in my workflow (Flye assembly) is failing because it uses socket.bind with Python, the error is "AF_UNIX path too long"

This is the error on Flye step:

Traceback (most recent call last):
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/managers.py", line 608, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/managers.py", line 154, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/connection.py", line 592, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
Traceback (most recent call last):
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/bin/flye", line 33, in <module>
    sys.exit(load_entry_point('flye==2.8.2', 'console_scripts', 'flye')())
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/site-packages/flye/main.py", line 797, in main
    _run(args)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/site-packages/flye/main.py", line 576, in _run
    jobs[i].run()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/site-packages/flye/main.py", line 328, in run
    consensus_fasta = cons.get_consensus(out_alignment, chunks_file,
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/site-packages/flye/polishing/consensus.py", line 60, in get_consensus
    aln_reader = SynchronizedSamReader(alignment_path,
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/site-packages/flye/utils/sam_parser.py", line 143, in __init__
    self.shared_manager = multiprocessing.Manager()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/managers.py", line 583, in start
    self._address = reader.recv()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/hps/nobackup2/production/metagenomics/jcaballero/miniconda3/envs/mgnify-lr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

after modifiying "connection.py" to print the address to bind I got:

/hps/nobackup2/production/metagenomics/jcaballero/runs/work-dir/tmp/test1/node-3c208199-ebf7-4edb-a6ef-07a97ba272ee-8e3d2d40-890f-42d2-b1ea-26c9e6993f0c/tmpg3_r9gh2/ece077c9-2665-4541-b213
-cd237817ce30/t0xg4ust_bpa8f12p/pymp-b4sxc7i1/listener-0zzrgznm

which is 252 chars, Unix sockets allowed only <200 chars.

Any chance that toil can generate shorted paths in toil?

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-758

mr-c commented 3 years ago

Notes for others reviewing this: I presume that @caballero is using toil-cwl-runner and one of his CWL CommandLineTools calls a local program namedflye

@caballero is that too-long path the working directory (also known as $(runtime.outdir)) or the temporary directory (also known as $(runtime.tmpdir)) ?

I bet the reason this hasn't come up before is that when toil-cwl-runner is used with a DockerRequirement then the long paths get mapped into short paths via docker or singularity so the problem goes away.

@caballero as a workaround, try setting TMPDIR to a shorter path as part of the CWL CommandLineTool description for flye

hints:
  EnvVarRequirement:
    TMPDIR: /tmp  # or /scratch/something; whatever is appropriate for where this is running
mr-c commented 3 years ago

I found that workaround by doing a web search for python multiprocessing path too long which lead me to https://github.com/broadinstitute/cromwell/issues/3647

DailyDreaming commented 3 years ago

@mr-c Ha, I enjoyed the cromwell link. I liked their notion of checking the initial length of the tmpdir and we should probably incorporate that as well.

This ticket may want to be reworking how the directories in the filestore are laid out (currently jobtempdir inside of workflowdir inside of workdir). I'll try and explore this tomorrow.

DailyDreaming commented 3 years ago

3438 should reduce this a bit.

caballero commented 3 years ago

This was solved in 5.2.0, thanks devs