Open grzanka opened 5 years ago
Could you post your error message? i.e. "Retried and failed: ..." Have you tried adjusting it manually? Are you able to suggest the appropriate value? Tutorial: Adjusting timeouts
# grzanka at Leszeks-Air.fritz.box in ~ [10:05:29]
→ idact-notebook pro --nodes 1 --cores 1 --memory-per-node 200GiB --walltime 0:14:00 --native-arg -p plgrid-bigmem --native-arg -A intdata
Loading environment.
Allocation parameters:
Nodes: 1
Cores: 1
Memory per node: 200GiB
Walltime: 0:14:00
Native arguments:
-p -> plgrid-bigmem
-A -> intdata
Allocating nodes.
2019-03-09 12:23:57 INFO: Installing key in '.ssh/authorized_keys.idact' for access to compute nodes.
2019-03-09 12:23:57 INFO: Creating the ssh directory.
2019-03-09 12:24:06 INFO: Still pending or configuring...
2019-03-09 12:24:11 INFO: Still pending or configuring...
2019-03-09 12:24:16 INFO: Still pending or configuring...
2019-03-09 12:24:21 INFO: Still pending or configuring...
2019-03-09 12:24:26 INFO: Still pending or configuring...
2019-03-09 12:24:31 INFO: Still pending or configuring...
(...)
2019-03-09 12:29:28 INFO: Still pending or configuring...
2019-03-09 12:29:32 INFO: Still pending or configuring...
2019-03-09 12:30:03 INFO: Retried and failed: config.retries[Retry.JUPYTER_JSON].{count=15, seconds_between=1}
2019-03-09 12:30:03 ERROR: Failure: Obtaining info about notebook from json file.
2019-03-09 12:30:08 INFO: Cancelling job 15137699.
2019-03-09 12:30:11 ERROR: Exception raised.
Traceback (most recent call last):
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 111, in run_task
result = fabric.tasks.execute(task)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/tasks.py", line 427, in execute
results['<local-only>'] = task.run(*args, **new_kwargs)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/tasks.py", line 174, in run
return self.wrapped(*args, **kwargs)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 90, in load_nbserver_json
nbserver_json_path=nbserver_json_path))
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/network.py", line 692, in host_prompting_wrapper
return func(*args, **kwargs)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/operations.py", line 1095, in run
shell_escape=shell_escape, capture_buffer_size=capture_buffer_size,
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/operations.py", line 959, in _run_command
error(message=msg, stdout=out, stderr=err)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/utils.py", line 359, in error
return func(message)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/fabric/utils.py", line 55, in abort
raise env.abort_exception(msg)
RuntimeError: run() received nonzero return code 1 while executing!
Requested: cat '/net/people/plgkongruencj/.idact/runtime/diTqoICmm7sLfXF2SapvZ7ypnWNOyLzd/nbserver-*.json' > /dev/null
Executed: /bin/bash --noprofile -l -c "cat '/net/people/plgkongruencj/.idact/runtime/diTqoICmm7sLfXF2SapvZ7ypnWNOyLzd/nbserver-*.json' > /dev/null"
NoneType: None
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 50, in retry_with_config
seconds_between_retries=seconds_between)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 80, in retry
raise e
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 77, in retry
return fun()
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 97, in <lambda>
lambda: node.run_task(task=load_nbserver_json),
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 117, in run_task
raise RuntimeError("Cannot run task.") from e
RuntimeError: Cannot run task.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter_app/main.py", line 144, in main
notebook = nodes[0].deploy_notebook()
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/nodes/node_impl.py", line 213, in deploy_notebook
local_port=local_port)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/jupyter/deploy_jupyter.py", line 99, in deploy_jupyter
config=node.config)
File "/Users/grzanka/Library/Python/3.6/lib/python/site-packages/idact/detail/helper/retry.py", line 58, in retry_with_config
raise RuntimeError(message) from e
RuntimeError: Retried and failed: config.retries[Retry.JUPYTER_JSON].{count=15, seconds_between=1}
Increasing seconds_between
to 5 seems to help, I need to run it more often to get better statistics.
This is very often a source of RuntimeError.