DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

--use-private-address causes job to hang #146

Closed letercarr closed 4 years ago

letercarr commented 5 years ago

When I add --use-private-address to a simple job (which usually takes a few minutes) the job seems to submit normally but hang.

Using google-v2 as a provided and dsub version: 0.2.1

mbookman commented 5 years ago

Thanks for the report @letercarr !

It may be that some additional configuration is required in your GCP project.

Looking at the the Pipelines v2 documentation for the parameter:

If set to true, do not attach a public IP address to the VM. Note that without a public IP address, additional configuration is required to allow the VM to access Google services.

See https://cloud.google.com/vpc/docs/configure-private-google-access for more information.

Have you used private addresses for other VMs?

To get more detail about where the dsub job is failing, check the dstat --full output. The first thing to look at would be the events; there may be something revealed there. Next would be to grab the internal-id and check the output of:

gcloud alpha genomics operations describe <id>

There may be additional event details there.

mbookman commented 5 years ago

Hi @letercarr,

We have tested this out by following the GCE docs:

https://cloud.google.com/vpc/docs/configure-private-google-access

and our dsub jobs ran successfully, though with one caveat:

When your VM has no public IP address it can't access Dockerhub, so any Docker images used for those tasks need to be in Google Container Registry. If you are using images from Dockerhub, you should push a copy to your Cloud project's container registry and then update your --image to use the new gcr:// path.

For example:

$ docker pull python:2.7-slim
2.7-slim: Pulling from library/python
...
35944cd3271f: Pull complete 
Digest: sha256:a17cb64cdd52190f9fe6c13680ccb7801b2abcb7a2cefbc936004550590e992f
Status: Downloaded newer image for python:2.7-slim

$ docker tag python:2.7-slim gcr.io/YOUR-PROJECT/python:2.7-slim

$ docker push gcr.io/YOUR-PROJECT/python:2.7-slim
The push refers to repository [gcr.io/YOUR-PROJECT/python]
...
5dacd731af1b: Layer already exists 
2.7-slim: digest: sha256:a17cb64cdd52190f9fe6c13680ccb7801b2abcb7a2cefbc936004550590e992f size: 1163

Then use --image gcr.io/YOUR-PROJECT/python:2.7-slim in your dsub command-line.

carbocation commented 5 years ago

I can also confirm that running dsub with private buckets hangs if your perms aren't configured correctly, and runs once they are. (Just worked through this over the weekend.)

carbocation commented 5 years ago

Specifically (since I just ran into this again and hadn't documented it well), my issue was as follows:

I am using dsub with a Docker image stored on gcr.io. I am using private IPs only.

If my project isn't configured so that my "VPC Networks" have "Private Google Access", then that container will never fetch. In that case, the GCP instance will sit there idly forever, giving no warning that the Docker image could not be fetched.

mbookman commented 4 years ago

The documentation now includes a section on configuring dsub VMs to have no public IP address:

https://github.com/DataBiosphere/dsub/blob/master/docs/compute_resources.md#public-ip-addresses

Including a section:

It is highly recommended that you test your job carefully, checking dstat ... --full events and your --logging files to ensure that your job makes progress and runs to completion. A misconfigured job can hang indefinitely or until the infrastructure terminates the task. The Google providers default --timeout is 7 days.