coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

This computation crashed during my live demo. How can I avoid this in the future? #296

Closed rsignell closed 2 months ago

rsignell commented 2 months ago

I ran this successfully before my live demo but it crashed during the actual demo: https://cloud.coiled.io/clusters/570914/account/esip-lab/information?organization=esip-lab&tab=Logs&filterPattern=

I restarted the notebook/kernel and ran again and it worked, but not sure what went wrong.
I would OBVI like to figure out how to avoid this in the future (hopefully through different cluster settings)

dchudz commented 2 months ago

Thanks Rich! Taking a look. It might help if you can say more about what the experience was on your side. (What output you saw in the console.)

dchudz commented 2 months ago

Also sorry that happened! Not the look we want to see things break in a demo!

rsignell commented 2 months ago

@dchudz because it was a live demo, I didn't really want to spend time looking/snapshotting at the errors -- I think it was something about lamda disconnecting or something?

I also have tried several times to recreate the problem so I could get more info, but have not been able to reproduce! Was hoping there was something captured in the info Coiled collects, but it sounds like not?

phofl commented 2 months ago

This sounds like your machine briefly disconnected from the cluster. Is this something that sounds plausible to you?

rsignell commented 2 months ago

Like dropping the internet connection? Yes, it's possible -- it was at the Bigelow Marine Center and during Ocean Hack Week -- so perhaps the network glitched. But I was interacting with Coiled from a 2i2c JupyterHub (not sure where that was running).

I'm including Alex Kerney @abkfenris here in case he has thoughts, as he has more knowledge of the infrastructure we were using.

phofl commented 2 months ago

Yeah internet connection would be possible, but if your client was located on a different server then this is unlikely.

abkfenris commented 2 months ago

We had about 30 folks in the room, but everyone was connected to the same JupyterHub (except maybe for Rich), and most of them were spinning up clusters using Rich's credentials at the same time, and I don't think we had other failures.

rsignell commented 2 months ago

I was on the same JupyterHub as everyone else. I gave everybody a coiled token so they could try the notebook without creating a coiled account, but when I had the problem, I asked if anyone was trying it in real time (while I was presenting) and the answer was no.

The notebook we were running is: https://github.com/fs-jbzambon/opendata-coawst/blob/main/COAWST_explore.ipynb

fjetter commented 2 months ago

The logs tell us that there were eight clients connected. Was this intended? Were there multiple people connected to that cluster? If you shared your credentials and people tried to start a cluster with the same cluster name you were using, this would cause them to connect to that cluster as well. This should work but it is a little uncommon and it's hard to tell what every client did. I don't really know what caused your demo to fail but it is possible that one of those other clients did something that interfered with it. If that's the case, I recommend telling people to use a different name.

One of those clients intentionally disconnected before/while you initiated for the cluster to close. It was the very first client that connected so it's reasonable to assume that this was yours.

(scheduler)     2024-08-27 17:49:30.626000 distributed.scheduler - INFO - Remove client Client-60f882cc-649c-11ef-9310-82dd876de329
(scheduler)     2024-08-27 17:49:30.626000 distributed.core - INFO - Received 'close-stream' from tls://127.0.0.1:51602; closing.
(scheduler)     2024-08-27 17:49:30.627000 distributed.scheduler - INFO - Remove client Client-60f882cc-649c-11ef-9310-82dd876de329
(scheduler)     2024-08-27 17:49:30.628000 distributed.scheduler - INFO - Close client connection: Client-60f882cc-649c-11ef-9310-82dd876de329

The Received 'close-stream' log message is only logged under three circumstances

so I suspect you turned off the cluster using shutdown after something bad happened?

So contrary to what @phofl suspected earlier, I think this was not a "bad connection" after all but something else.

abkfenris commented 2 months ago

I bet we had someone explore ahead, get to the end of the notebook and issue cluster.shutdown() not realizing that everyone was using the same cluster.

I happened to do a very similar thing once and take out my entire high school's computing infrastructure...

Maybe for demos, the using name=f'coawst-{uuid()}' to make sure each user gets their own cluster when sharing accounts.

Rich, when you looked in Coiled, I think we only saw a single cluster, instead we were confused why we didn't see a bunch of clusters.

rsignell commented 2 months ago

@fjetter @abkfenris, yep, I think you nailed it. I forgot that by using the same -name we would be using the same cluster. Great idea to use name=f'coawst-{uuid()}' !

And I indeed did have a cluster shutdown at the end of the notebook. GRRRR.... User error on my part as instructor! Oh wait, I mean, a "learning experience!" 😄