DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Cannot assign requested address #156

Open carbocation opened 5 years ago

carbocation commented 5 years ago

I ran a set of 50,000 jobs yesterday and they all appear to have completed successfully. However, the dsub tool (which I ran from the cloud shell) yielded the following error at the end. Again, this doesn't seem to have impacted the run, but I haven't seen this error reported so I wanted to mention it.

Waiting for job to complete...
2019-05-06 02:17:10.824555: Exception error: [Errno 99] Cannot assign requested address
Traceback (most recent call last):
  File "/home/jamesp/dsub/dsub_libs/bin/dsub", line 11, in <module>
    load_entry_point('dsub==0.3.1', 'console_scripts', 'dsub')()
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 956, in main
    dsub_main(prog, argv)
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 945, in dsub_main
    launched_job = run_main(args)
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 1028, in run_main
    unique_job_id=args.unique_job_id)
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 1141, in run
    poll_interval, retries, job_descriptor)
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 739, in _wait_and_retry
    for t in tasks:
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_v2.py", line 1124, in lookup_job_tasks
    page_size, page_token)
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_v2.py", line 1058, in _operations_list
    response = google_base.Api.execute(api)
  File "build/bdist.linux-x86_64/egg/retrying.py", line 49, in wrapped_f
  File "build/bdist.linux-x86_64/egg/retrying.py", line 206, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 247, in get
  File "build/bdist.linux-x86_64/egg/retrying.py", line 200, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 49, in wrapped_f
  File "build/bdist.linux-x86_64/egg/retrying.py", line 206, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 247, in get
  File "build/bdist.linux-x86_64/egg/retrying.py", line 200, in call
  File "/home/jamesp/dsub/dsub_libs/local/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_base.py", line 593, in execute
    raise exception
socket.error: [Errno 99] Cannot assign requested address

dsub is 0.3.1:

$ dsub --version
dsub version: 0.3.1
mbookman commented 5 years ago

Thanks for reporting, @carbocation.

We could add error 99 to the list of TRANSIENT_SOCKET_ERROR_CODES that are retried, although I'm at a loss to come up with a scenario in which this error would be generated transiently.

By any chance, does the time of the error (2019-05-06 02:17:10) coincide with the job finishing or anything else you can point to?

Thanks!

carbocation commented 5 years ago

The final jobs were launched at 2019-05-06 02:00:00 according to the pipelines tab, so 2:17 is pretty close to the right time to finish. These jobs each take < 60 seconds in the typical case. (I wasn't actively monitoring at that time.)

I actually now see that four of the jobs launched between 01:59:00 and 02:00:00 failed. (I was wrong in my initial review, when I thought that all jobs succeeded.)

Thanks to the tagging and logging that dsub does, I am able to look for the logs for these specific jobs - interestingly, none of those four jobs has any logfile whatsoever. I don't know if that absence of log information is of any use. I can also tell you that there were no VMs left up and running by the time of this crash.