GCP cluster connection timeout over external IP

eric-czech commented 3 years ago

Very excited to see https://github.com/dask/dask-cloudprovider/pull/131 in! I gave this a try again today @quasiben but had an issue. I ran this:

from dask_cloudprovider.gcp.instances import GCPCluster
cluster = GCPCluster(
    name='dask-gcp-test-1', 
    zone='us-east1-c', 
    machine_type='n1-standard-8', 
    projectid=MY_PROJECT_ID,
    docker_image="daskdev/dask:latest",
    ngpus=0
)
Launching cluster with the following configuration: 
  Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014 
  Docker Image: daskdev/dask:latest 
  Machine Type: n1-standard-8 
  Filesytsem Size: 50 
  N-GPU Type:  
  Zone: us-east1-c 
Creating scheduler instance
dask-7c88f69a-scheduler
    Internal IP: 10.142.0.2
    External IP: 34.75.186.200
Waiting for scheduler to run
# No connection made after ~10 mins so I killed the process

The scheduler connection never occurred despite this being run from a VM in GCP. That would make sense if it's trying to connect with the external IP and I found that if I try to connect directly that way (via dask.distributed.Client), this also doesn't work. I have no firewall rules set up to allow ingress to 8786 and I'd rather not add them. Is there a way to have GCPCluster or its parent classes use internal IPs instead? I would much prefer that. If I try to connect using the internal IP in this case (again via dask.distributed.Client), everything is fine so I know the VM is up and running correctly.

How did you get around this in your testing? Did you configure firewall rule excepts for the dask scheduler port or perhaps already have them in place?

quasiben commented 3 years ago

Thanks @eric-czech for testing this out. Yes, we did test with firewall rules opening tcp:8888,8787,22,8786. We were testing from outside of GCP a needed to ping the scheduler externally. Currently, we capture both the internal and external IPs:

https://github.com/dask/dask-cloudprovider/blob/09c28979fd283d483f46ffec7e5165d887a9083a/dask_cloudprovider/gcp/instances.py#L263-L265

Then use the external address for assessing if the scheduler is up. This could be configurable. You are connecting to the scheduler from within GCP, correct ? That is, your client can ping the internal IP address ?

eric-czech commented 3 years ago

Yep, I can reach the scheduler on the internal IP and I'm connecting from GCP.

Ultimately, I need a cluster with no public ingress (ideally no external ips either) since it will operate on sensitive data.

It would be great if use of internal IPs was configurable. Without that, it's difficult to create transient clients and clusters in the same infrastructure. Is that a difficult addition?

eric-czech commented 3 years ago

Creating, scaling up/down and deleting clusters works for me now with https://github.com/dask/dask-cloudprovider/pull/164. Thanks again @quasiben!

quasiben commented 3 years ago

Are you also testing adaptive scaling ?

eric-czech commented 3 years ago

Are you also testing adaptive scaling ?

I tried that too and it's working nicely -- especially after setting the interval/wait_count parameters to avoid spinning up VMs for each job that's run.

On a somewhat related note, is there a way to "reconnect" a VMCluster if the process it was created in dies?

quasiben commented 3 years ago

@eric-czech if the scheduler doesn't die when the process exists, then yes as long as you have the scheduler address written down. I think using the auto_shutdown flag should be set to False in this case. @jacobtomlinson any additional thoughts here ?

jacobtomlinson commented 3 years ago

Disconnecting and reconnecting isn't a use case that is well supported today. All of the cluster managers in this package are intended to clean up any hanging resources if the main process dies to avoid accidental costs. I think only the EC2Cluster does this well today.

Could you share more about why the main process is being killed? And where it is running from?

eric-czech commented 3 years ago

Disconnecting and reconnecting isn't a use case that is well supported today. Could you share more about why the main process is being killed? And where it is running from?

I see. The most compelling argument I have for that use case is in pipeline steps/scripts that would create dask clients in a deployment-agnostic manner, most likely through the scheduler address in a config file or env file. I am imagining that I can simply have hooks in my code that create and destroy a cluster for every script (or fairly large unit of work), though that seems less ideal to me than having it be possible to provide a scheduler address to the script.

An alternative I'm considering is having a separate script like this that I run interactively when I want a cluster for a while:

from threading import Event
cluster = GCPCluster(...)
# print scheduler address info to use independently
try:
    Event().wait()
except KeyboardInterrupt:
    cluster.close()

Is this a type of usage that has been requested before? To phrase it simply, I'm looking for a way to create a cluster w/o needing to embed that creation in my code. I realize there are merits to both approaches, and I'm mostly probing at what the design goals were with Cloud Provider.

jacobtomlinson commented 3 years ago

This is really useful feedback thanks!

You may find this blog post of interest as it discusses the state of Dask cluster deployments and was written earlier this year. Particularly the section on ephemeral vs fixed clusters.

One large assumption in dask-cloudprovider is that it provides ephemeral clusters. Today we do not have native cloud fixed cluster options. For GCP a fixed cluster setup may look like a cloud deployment manager template.

Comparing that with Kubernetes we have dask-kubernetes for ephemeral deployments and the Dask helm-chart for fixed deployments.

I am imagining that I can simply have hooks in my code that create and destroy a cluster for every script (or fairly large unit of work)

This is typically the intended use case for ephemeral Dask clusters.

though that seems less ideal to me than having it be possible to provide a scheduler address to the script

I would be interested to know why you see this as less ideal? Primary reasons I can think of is that the cluster startup is too slow and your scripts are small, fast and numerous.

Is this a type of usage that has been requested before?

The type of usage you are referring to is a fixed cluster. There is demand for fixed clusters in the Dask ecosystem, but less so than ephemeral clusters. Typically because fixed clusters can be wasteful in terms of money or credits. I see more fixed clusters running on in-house hardware where the cost is fixed up front.

The most compelling argument I have for that use case is in pipeline steps/scripts that would create dask clients in a deployment-agnostic manner

Once concern with this approach is that you may end up with multiple clients sharing a single cluster. This is not recommended. Dask does not differentiate between clients or have any concepts of queueing or fair usage. There should always be a one to one relationship between clusters and clients.

eric-czech commented 3 years ago

One large assumption in dask-cloudprovider is that it provides ephemeral clusters. Today we do not have native cloud fixed cluster options. For GCP a fixed cluster setup may look like a cloud deployment manager template. The type of usage you are referring to is a fixed cluster. There is demand for fixed clusters in the Dask ecosystem, but less so than ephemeral clusters. Typically because fixed clusters can be wasteful in terms of money or credits. I see more fixed clusters running on in-house hardware where the cost is fixed up front.

Thinking about those two points a bit and looking at https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters#cluster-types makes me a bit confused because the ephemeral cluster section mentions:

A basic way of doing this would be to create a bash script which calls ssh and sets up the cluster. You would run this script in the background while performing your work and then kill it once you are done. We will cover other implementations of this in the coming sections.

That makes sense to me, and it's what I meant with that little script that would run in the background and kill the cluster once I interrupt it manually or SSH times out the connection. I'm definitely after an ephemeral cluster. I do think though that it is reasonable to decouple the management of an ephemeral cluster from the management of an ephemeral process.

I would be interested to know why you see this as less ideal? Primary reasons I can think of is that the cluster startup is too slow and your scripts are small, fast and numerous.

I haven't found a great way to work on either scripts or notebooks that doesn't result in me frequently rerunning them from scratch. Keeping notebooks open with too many non-sequential edits certainly becomes problematic so in general I would say that I'd prefer an ephemeral cluster that is not tied to my development process, even if just to work on a single script. I think that's a common experience and that the extra few minutes it would take to kick over a kernel and keep iterating would be a problem if the cluster creation was always inline with code.

Overall, I think having code to create the cluster with the rest of the code makes sense once its all written, but it's a difficult model to work under while writing it. Maybe I'm missing something though, perhaps other features/possibilities?

quasiben commented 3 years ago

You cloud create the cluster in a separate file with a little bit of formalism around passing the scheduler. Would something like the following work for you ?

# start-cluster.py
 test.py
from time import sleep
from dask_cloudprovider.gcp.instances GCPCluster

def main():
    cluster = GCPCluster(...)
    with open('sched.json', 'w') as f:
        json.dump(cluster.scheduler.identity(), f, indent=2)
    print("Cluster created: ", cluster)
    while True:
        pass

if __name__ == "__main__":
    main()

# client-script.py
from dask.distributed import Client

client = Client(scheduler_file="sched.json")
...

jacobtomlinson commented 3 years ago

I'm definitely after an ephemeral cluster. I do think though that it is reasonable to decouple the management of an ephemeral cluster from the management of an ephemeral process.

Sure. There is nothing stopping you having two Python scripts, one which sets up the cluster and one (or more) which makes use of it.

I haven't found a great way to work on either scripts or notebooks that doesn't result in me frequently rerunning them from scratch.

The Dask Jupyter Lab extension is commonly used for this. Cluster creation is handled in the Jupyter side bar.

eric-czech commented 3 years ago

You cloud create the cluster in a separate file with a little bit of formalism around passing the scheduler. Would something like the following work for you ?

Yep @quasiben, that's pretty much what I had in mind. Ultimately I used this which creates a REPL using FIRE that makes it easier to attach docs to available commands in the REPL (if I had added any). I was exporting the scheduler info in to a file and then loading it as environment variables for other processes.

I haven't found a great way to work on either scripts or notebooks that doesn't result in me frequently rerunning them from scratch.

The Dask Jupyter Lab extension is commonly used for this. Cluster creation is handled in the Jupyter side bar.

👍 I haven't tried the extension again in a while but I'll see how well that works with the Cloud Provider classes. Are there any instructions on how to configure that? It seems a little tricky given that the Cloud Provider cluster types have their own external configuration sources like env vars and cloudprovider.yaml.

I'm still not sure this solves my problem when working on scripts within the context of an orchestration system like Prefect though. More specifically, I'm using snakemake to build dask script components that are fairly small in scope, and these pipelines are difficult to work on with jupyter in the loop. Independent operation of Cloud Provider is convenient for it.

At a high level I'm arguing for a control plane for Cloud Provider like the script @quasiben mentioned or what I have been using. Feel free to triage this elsewhere @jacobtomlinson or let it go if it doesn't seem important

jacobtomlinson commented 3 years ago

Are there any instructions on how to configure that?

You can find configuration for the lab extension here. All of the cloudprovider cluster managers should work, but I would be happy to resolve any bugs that arise.

I'm still not sure this solves my problem when working on scripts within the context of an orchestration system like Prefect though.

Absolutely. I guess this conversation is highlighting our bias towards interactive Jupyter driven workloads, and we need to consider pipeline style systems more.

I'm not familiar enough with Prefect to help here though.

At a high level I'm arguing for a control plane for Cloud Provider

The script that @quasiben mentioned is one way I would expect folks to use things here. That code snippet should work today.

Could you expand on what changes you would like to see?

Feel free to triage this elsewhere @jacobtomlinson or let it go if it doesn't seem important

Triaging may be a good idea. I'm really appreciating this feedback so it's definitely important. It's good to hear how folks are using things and how we can improve it.

eric-czech commented 3 years ago

The script that @quasiben mentioned is one way I would expect folks to use things here. That code snippet should work today.

+1 to a CLI in dask cloud provider that provides sessions for clusters. Perhaps that much is worth a separate issue and I suspect it won't be worth working on until more users ask for it. Until then, it's easy enough for me to make my own scripts or use @quasiben's suggestions.

To rif a little bit more on this: I do ultimately think CP becomes more attractive to the life sciences crowd when the cluster management is decoupled from the code (i.e. is more pipeline friendly) .. OR all infrastructure management is entirely specified by code (e.g. pulumi, terraform). Anything in between is a bit awkward IMO since something like creating a cluster feels like it should live, perhaps, in an orchestration and/or resource management system so that it can be integrated within a global awareness of infrastructure bugdets/resources.

Thanks for the help from both of you on this!

eric-czech commented 3 years ago

One last thought re: ephemeral clusters tied to a single process -- debugging them is impossible when useful information is in the dask worker logs after a failure.

jacobtomlinson commented 3 years ago

One last thought re: ephemeral clusters tied to a single process -- debugging them is impossible when useful information is in the dask worker logs after a failure.

This is a good point. It would perhaps be useful to print cluster.get_logs() in the event of a failure before the process exits. But that would only work for certain failure modes.

dask / dask-cloudprovider

GCP cluster connection timeout over external IP #161