garstka / idact

A Python 3.5+ library that takes care of several tedious aspects of working with big data on an HPC cluster.
MIT License
6 stars 2 forks source link

Increase timeout on Jupiter_JSON #11

Open grzanka opened 5 years ago

grzanka commented 5 years ago

This is very often a source of RuntimeError.

garstka commented 5 years ago

Could you post your error message? i.e. "Retried and failed: ..." Have you tried adjusting it manually? Are you able to suggest the appropriate value? Tutorial: Adjusting timeouts

grzanka commented 5 years ago
# grzanka at Leszeks-Air.fritz.box in ~ [10:05:29]
→ idact-notebook pro --nodes 1 --cores 1 --memory-per-node 200GiB --walltime 0:14:00 --native-arg -p plgrid-bigmem --native-arg -A intdata  
Loading environment.
Allocation parameters:
    Nodes: 1
    Cores: 1
    Memory per node: 200GiB
    Walltime: 0:14:00
    Native arguments:
      -p -> plgrid-bigmem
      -A -> intdata

Allocating nodes.
2019-03-09 12:23:57 INFO: Installing key in '.ssh/authorized_keys.idact' for access to compute nodes.
2019-03-09 12:23:57 INFO: Creating the ssh directory.
2019-03-09 12:24:06 INFO: Still pending or configuring...
2019-03-09 12:24:11 INFO: Still pending or configuring...
2019-03-09 12:24:16 INFO: Still pending or configuring...
2019-03-09 12:24:21 INFO: Still pending or configuring...
2019-03-09 12:24:26 INFO: Still pending or configuring...
2019-03-09 12:24:31 INFO: Still pending or configuring...
(...)
2019-03-09 12:29:28 INFO: Still pending or configuring...
2019-03-09 12:29:32 INFO: Still pending or configuring...
2019-03-09 12:30:03 INFO: Retried and failed: config.retries[Retry.JUPYTER_JSON].{count=15, seconds_between=1}
2019-03-09 12:30:03 ERROR: Failure: Obtaining info about notebook from json file.
2019-03-09 12:30:08 INFO: Cancelling job 15137699.
2019-03-09 12:30:11 ERROR: Exception raised.
Traceback (most recent call last):
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 111, in run_task
    result = fabric.tasks.execute(task)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/tasks.py", line 427, in execute
    results['<local-only>'] = task.run(*args, **new_kwargs)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/tasks.py", line 174, in run
    return self.wrapped(*args, **kwargs)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 90, in load_nbserver_json
    nbserver_json_path=nbserver_json_path))
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/network.py", line 692, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/operations.py", line 1095, in run
    shell_escape=shell_escape, capture_buffer_size=capture_buffer_size,
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/operations.py", line 959, in _run_command
    error(message=msg, stdout=out, stderr=err)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/utils.py", line 359, in error
    return func(message)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/utils.py", line 55, in abort
    raise env.abort_exception(msg)
RuntimeError: run() received nonzero return code 1 while executing!

Requested: cat '/net/people/plgkongruencj/.idact/runtime/diTqoICmm7sLfXF2SapvZ7ypnWNOyLzd/nbserver-*.json' > /dev/null
Executed: /bin/bash --noprofile -l -c "cat '/net/people/plgkongruencj/.idact/runtime/diTqoICmm7sLfXF2SapvZ7ypnWNOyLzd/nbserver-*.json' > /dev/null"

NoneType: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 50, in retry_with_config
    seconds_between_retries=seconds_between)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 80, in retry
    raise e
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 77, in retry
    return fun()
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 97, in <lambda>
    lambda: node.run_task(task=load_nbserver_json),
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 117, in run_task
    raise RuntimeError("Cannot run task.") from e
RuntimeError: Cannot run task.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter_app/main.py", line 144, in main
    notebook = nodes[0].deploy_notebook()
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 213, in deploy_notebook
    local_port=local_port)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 99, in deploy_jupyter
    config=node.config)
  File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 58, in retry_with_config
    raise RuntimeError(message) from e
RuntimeError: Retried and failed: config.retries[Retry.JUPYTER_JSON].{count=15, seconds_between=1}
grzanka commented 5 years ago

Increasing seconds_between to 5 seems to help, I need to run it more often to get better statistics.