DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Lose network connectivity in docker container #116

Closed jlhg closed 6 years ago

jlhg commented 6 years ago

Hello,

I'd like to report an issue. Here's the points:

  1. I am using dsub to create a Runner VM.
  2. I run a customer Ruby scripts inside a container on this Runner VM.
  3. This custom scripts uses dsub to actually create an execution pipeline.
  4. After approx. 20 hours Containers on Runner VM looses network connectivity (no ping to the internal or external IPs, no DNS Name Resolution) and whole pipeline fails because of that.

Below is the error log:

Exception ServerNotFoundError: Unable to find the server at www.googleapis.com
Traceback (most recent call last):
  File "/usr/local/bin/dsub", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/dsub/commands/dsub.py", line 820, in main
    dsub_main(prog, argv)
  File "/usr/local/lib/python2.7/dist-packages/dsub/commands/dsub.py", line 809, in dsub_main
    launched_job = run_main(args)
  File "/usr/local/lib/python2.7/dist-packages/dsub/commands/dsub.py", line 874, in run_main
    provider_base.get_provider(args, resources),
  File "/usr/local/lib/python2.7/dist-packages/dsub/providers/provider_base.py", line 41, in get_provider
    getattr(args, 'dry_run', False), args.project)
  File "/usr/local/lib/python2.7/dist-packages/dsub/providers/google.py", line 620, in __init__
    credentials)
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python2.7/dist-packages/dsub/providers/google_base.py", line 481, in setup_service
    api_name, api_version, credentials=credentials)
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery.py", line 222, in build
    requested_url, discovery_http, cache_discovery, cache)
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery.py", line 269, in _retrieve_discovery_doc
    resp, content = http.request(actual_url)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1694, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1434, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1360, in _conn_request
    raise ServerNotFoundError("Unable to find the server at %s" % conn.host)
httplib2.ServerNotFoundError: Unable to find the server at www.googleapis.com
#<Thread:0x0000557635540a78@/usr/local/bundle/gems/happy_runner-0.1.0/lib/happy_runner/abstract/pipeline.rb:45 run> terminated with exception (report_on_exception is true):
/usr/local/bundle/gems/happy_runner-0.1.0/lib/happy_runner/abstract/step.rb:103:in `run': unhandled exception
    from /usr/local/bundle/gems/happy_runner-0.1.0/lib/happy_runner/abstract/pipeline.rb:45:in `block (2 levels) in run'
/usr/local/bundle/gems/happy_runner-0.1.0/lib/happy_runner/abstract/step.rb:103:in `run': unhandled exception
    from /usr/local/bundle/gems/happy_runner-0.1.0/lib/happy_runner/abstract/pipeline.rb:45:in `block (2 levels) in run'

I login Runner VM and issue the command sudo sysctl -w net.ipv4.ip_forward=1 in it, the network connection problem inside the docker container is resolved.

The value of net.ipv4.ip_forward in new created Runner VM is 1. It's strange that the value changes to zero after a long time run.

mbookman commented 6 years ago

Hi Jian-Long,

The problem appears to be with Google Compute Engine or something in the way that the Pipelines API v1alpha2 Docker setup interacts with GCE.

The problem has been reported on this thread:

https://groups.google.com/forum/#!topic/google-genomics-discuss/yXnSHobOYmE

I would suggest following up there. You may have some additional information that helps debug the problem.

That said, for your own purposes to get moving, you can use the "google-v2" provider for dsub. We just recently announced it here. This is support for the Pipelines API v2alpha1 and is where future enhancements will be added.

mbookman commented 6 years ago

This issue has been fixed in the Google Genomics v1alpha2 API.