coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Getting started #169

Closed tomgallagher closed 2 years ago

tomgallagher commented 2 years ago

I'm just trying to get started with your coiled / streamlit example

https://github.com/coiled/coiled-resources/blob/main/streamlit-with-coiled/coiled-streamlit-deploy.py

I'm almost there but I need help with software environments.

This code

cluster = coiled.Cluster(
  n_workers=10,
  name='streamlit-deployed',
  software="coiled-examples/streamlit",
  shutdown_on_close=False, 
)

Refers to software for which the example does not provide an example :)

I'm getting this error:

ServerError: Unable to access base docker image '077742499581.dkr.ecr.us-east-1.amazonaws.com/prod/coiled-examples-streamlit:546c2819-681e-47a4-b488-af08970537bb'. Ensure you have a properly configured Container Registry, or get in touch if you need help or think this is a bug. time="2022-06-16T09:00:29Z" level=fatal msg="Error parsing image name \"docker://077742499581.dkr.ecr.us-east-1.amazonaws.com/prod/coiled-examples-streamlit:546c2819-681e-47a4-b488-af08970537bb\": Error reading manifest 546c2819-681e-47a4-b488-af08970537bb in 077742499581.dkr.ecr.us-east-1.amazonaws.com/prod/coiled-examples-streamlit: denied: User: arn:aws:iam::861514179139:user/streamlit_coiled is not authorized to perform: ecr:BatchGetImage on resource: arn:aws:ecr:us-east-1:077742499581:repository/prod/coiled-examples-streamlit because no resource-based policy allows the ecr:BatchGetImage action"

How can I create a new software environment with dependencies that match the requirements of the example?

Really I just want a set of compatible instructions between this

coiled.create_software_environment(
    name='datafile-comparison-v1',
    pip=["dask[complete]", "xarray==0.15.1", "numba"],
)

And your example

Thanks

mrocklin commented 2 years ago

Thank you @tomgallagher for the excellently worded issue.

cc @scharlottej13 . Maybe it makes sense to include the software environment creation step in the documentation. This will slow things down a bit, but it also empowers the user a bit to change things themselves in the future? No strong preference.

ncclementi commented 2 years ago

@tomgallagher Thank you for reporting this. We are investigating what happened with the original software environment and why it's not working. But in the meantime if you go back a step in the repository of the script link you provided you will see there is a file streamlit.yml https://github.com/coiled/coiled-resources/blob/main/streamlit-with-coiled/streamlit.yml that containse the dependencies to re-create the exact same environment by doing

coiled.create_software_environment(
    name="your-conda-env-name",
    conda="streamlit.yml",
)

If you want to purely use pip, there is a requiremenst.txt in the repository too https://github.com/coiled/coiled-resources/blob/main/streamlit-with-coiled/requirements.txt and you will do:

coiled.create_software_environment(
    name="your-pip-env-name",
    pip="requirements.txt",
)

If you want to make this environment compatible with other libraries or update to more recent versions of dask, I would suggest you copy yaml/txt file, modify it and update the dependencies you want.

Let us know if you have any questions, we are here to help.

tomgallagher commented 2 years ago

Hey thanks for getting back so quickly

I've made some progress but now have this message in my logs

2022-06-16 17:04:48.359 Using existing cluster: 'coiled-streamlit (id: 34931)' 2022-06-16 17:04:48.774 Creating Cluster (name: coiled-streamlit, https://cloud.coiled.io/tomgallagher/clusters/34931/details ). This might take a few minutes... 2022-06-16 17:04:50.135 Scheduler: ready Workers: 10 ready (of 10) 2022-06-16 17:04:50.135 Scheduler: ready Workers: 10 ready (of 10) > 2022-06-16 17:04:51.582 error sending AWS credentials to cluster: Could not connect to the endpoint URL: "http://169.254.169.254/latest/api/token" 2022-06-16 17:04:53.831 Uncaught app exception

Do you know what I'm doing wrong here?

hayesgb commented 2 years ago

cc: @ntabris

ntabris commented 2 years ago

Do you know what I'm doing wrong here?

Maybe nothing.

The error sending AWS credentials to cluster isn't fatal, it means that it wasn't able to create an STS token to sent to the cluster for accessing (eg) S3 or other datasources that might use a token for AWS authentication. But could be a problem, and there are ways to deal with it, but that by itself wouldn't prevent your cluster from working (though it may cause downstream errors when your cluster tries to, e.g., read from S3).

Do you know if the cluster is otherwise up and working? Or what if anything happened when you tried to run something on it?

I do see the Uncaught app exception in the logs you shared, but don't know where that's coming from... I don't think our coiled client emits that but I could be wrong.

I also see that the cluster was running from about 7 minutes, from 2022-06-16T16:03:19UTC to 2022-06-16T16:10:04UTC.

ncclementi commented 2 years ago

@tomgallagher Are you following the example exactly as is in the code provided in the example? Because recently the nyc public data was modified, and this line won't work anymore. I wonder if that is the problem.

https://github.com/coiled/coiled-resources/blob/287c79cd4b72155f3e649f69a097c471908cd6a3/streamlit-with-coiled/coiled-streamlit-deploy.py#L66

If this is your case, can you try to replace that line with

"s3://nyc-tlc/csv_backup/yellow_tripdata_2015-*.csv"
tomgallagher commented 2 years ago

Perfect! I'm in business. Thanks very much. Just wanted to get the exact copy working before I moved on.

FYI still getting the error

2022-06-16 20:23:03.503 error sending AWS credentials to cluster: Could not connect to the endpoint URL: "http://169.254.169.254/latest/api/token"

ncclementi commented 2 years ago

@tomgallagher Glad to hear things are moving. For the error that you are still running. It would be useful to know in which line of code you are getting it, and if this could be related to streamlit. I see this line on the deploy code, that I'm not sure how it works but it could have something to do with it.

https://github.com/coiled/coiled-resources/blob/287c79cd4b72155f3e649f69a097c471908cd6a3/streamlit-with-coiled/coiled-streamlit-deploy.py#L42

@rrpelgrim you wrote this blogpost, have you ever seen the error reported in the comment above ?

ntabris commented 2 years ago

@ncclementi no, this isn't about the Coiled token, it's having a problem getting the STS token from the Amazon instance metadata service (since the coiled client is presumably running on an EC2 instance or something that not, not running locally). I'm not sure why it's having a problem hitting Amazon instance metadata service, there are various possibilities.

tomgallagher commented 2 years ago

One last question and then I'll leave you alone :) Promise

In the code example, you have this line

if st.button('Shutdown Cluster'):
    with st.spinner("Shutting down your cluster..."):
        client.shutdown()

I've been experimenting with the client.shutdown command and, while it may stop the dask.distributed client, the command does not seem to be passed back up to the coiled cluster. Should I be passing an argument or should I also be calling coiled.delete_cluster(name="my-cluster") ?

ncclementi commented 2 years ago

@tomgallagher I'm not able to reproduce the problem, when I run the code below, my cluster closes gracefully. If you run this, does your cluster keep running?

from coiled import Cluster
from dask.distributed import Client

cluster = Cluster(name = "test_shutdown")
client = Client(cluster)

client.shutdown()

Screen Shot 2022-06-17 at 12 16 21 PM

Can you give us more context or a minimal reproducible example? When you say " the command does not seem to be passed back up to the coiled cluster." what is exactly what you see?

tomgallagher commented 2 years ago

Hi

The client reports as closed but on the dashboard the cluster is not stopped. For the time being, I'm closing the client then also calling coiled.delete_cluster(name="my-cluster") Which seems to work fine.

ntabris commented 2 years ago

The client reports as closed but on the dashboard the cluster is not stopped.

Stopping dask should result in Coiled detecting this and cleaning up the cluster infrastructure, but it's faster and more reliable if you tell Coiled to stop the cluster, like you're doing now.

In case it's helpful, another way to do this is...

with coiled.Cluster(...) as cluster:
  client = Client(cluster)
  ...  # your code

# context manager exit handles shutting down the cluster
dchudz commented 2 years ago

should result in Coiled detecting this and cleaning up the cluster infrastructure, but it's faster and more reliable if you tell Coiled to stop the cluster

The weird thing is client.shutdown should actually be telling Coiled to stop the cluster, via cluster.close (the client has a reference to the cluster).

Tom has a workaround and we're not able to reproduce, so it's not crucial to debug I think.