coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Cluster stalling without error message #172

Closed ncclementi closed 2 years ago

ncclementi commented 2 years ago

From Kathryn Berger in slack https://coiled-users.slack.com/archives/C0195GJKQ1G/p1657101903467519

Hello Coiled Team, I’m having issues with Coiled clusters that appear to start without issue but stall over time without issuing an error message. I’ve encountered this for the past 2 days now, while previously had been running everything smoothly with the same configuration. With this Cluster (drewbo-38c7fef1-b) the worker logs suggested a memory leak issue, so I changed the instance and number of workers when (see this cluster: drewbo-191f5c4a-b) When waiting for the latter cluster to initiate, the GUI itself was slow to initiate revealing a cryptic error message stating among other things…

WARNING: App 'org.gnome.Shell.desktop' respawning too quickly
Jul  6 08:53:19 ip-10-1-4-158 gnome-session[2322]: gnome-session-binary[2322]: CRITICAL: We failed, but the fail whale is dead. Sorry....

The second instance, despite changing the instance size and number of workers appears to run into the same stalling issue. Is it possible to shed light on what might be going on here? It seems similar to Julian’s issue above, except that ours progresses through 80-90% of the task. In general, both the Coiled logs and AWS CloudWatch logs have revealed nothing out of the ordinary. Thanks in advance for your time!

ncclementi commented 2 years ago

The logs show there is something happening on zict,

Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]: Traceback (most recent call last):
Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]:   File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3558, in _prepare_args_for_execution
Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]:     data[k] = self.data[k]
Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]:   File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 108, in __getitem__
Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]:     raise KeyError(key)
Jul  6 09:39:21 ip-10-1-4-158 cloud-init[2465]: KeyError: 'save_metadata-0c54af58-9c1a-4e43-a1fa-4f59a90f240e'

In these lines https://github.com/dask/zict/blob/d1bf75e951ec1d1f6420bab1e99fb2c990adc9ce/zict/buffer.py#L102-L108

It seems it can't find the key. My guess here is it's running out of disk space, trying to keep the key in RAM but there is no room in RAM either. Although I'm not entirely sure.

@crusaderky would you mind commenting here?

here are is the analytics page that shows their code https://cloud.coiled.io/drewbo/analytics/clusters/8922?version=2 here is the details page of the cluster with logs https://cloud.coiled.io/drewbo/clusters/38596/details

crusaderky commented 2 years ago

My guess here is it's running out of disk space, trying to keep the key in RAM but there is no room in RAM either.

No. We handle out of disk space errors gracefully. They would not cause a generic KeyError. The error message is saying that the worker state machine is corrupted. This is not a OS/hardware error and it should never happen. If you run Worker.validate_state(), it will trip.

We need a cluster dump to debug this issue. Before we do that,

  1. is the issue easy to reproduce?

  2. is the client using the latest version of distributed?

  3. if not, does the issue disappear after upgrading to the latest version? It is not strictly necessary to have a dump from the latest distributed version, but it helps.

  4. does setting distributed.worker.validate: True cause the worker to trip faster and with more explicit info?

ncclementi commented 2 years ago

cc: @kathrynberger for visibility.

Here is some documentation to get a cluster dump https://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state

client.dump_cluster_state(dump_path)

Note that this could be written to an S3 bucket, in which case you'll do something like

client.dump_cluster_state(dump_uri, **storage_options)
kathrynberger commented 2 years ago

We need a cluster dump to debug this issue. Before we do that,

  1. is the issue easy to reproduce?

For me, yes - this issue has been reproduced 4x since Monday after I returned from a brief time away from the office. Previously, using the same configurations everything had run relatively smoothly.

  1. is the client using the latest version of distributed?

We had been using distributed==2022.01.0, so I've since upgraded it just now.

  1. if not, does the issue disappear after upgrading to the latest version?

I'm still working to determine this, as I'm trying to test this locally first without interfering with what is currently in production. Unfortunately, I've not been able to complete this today and will not be back online until next Tuesday when I will pick this back up again.

However a question about providing a dump_cluster_state, as I cannot currently gather it from the documentation. Does it matter the order in which it is placed?

It is not strictly necessary to have a dump from the latest distributed version, but it helps.

  1. does setting distributed.worker.validate: True cause the worker to trip faster and with more explicit info?
kathrynberger commented 2 years ago

Hi @ncclementi and @crusaderky thanks very much for your time and effort on trouble-shooting this one. I've updated distributed, but that does not seem to solve the problem. I now confirm that this issue repeats itself every time I run Coiled. As described above, I've had to test this locally first so as to not interfere with what is currently in production, so has taken a bit longer to explore.

Would it be possible to get clarification on where the dump_cluster_state should be placed (at the start? or end of the script?), as I cannot gather it from the documentation and my current implementation isn't providing an output to the S3 bucket.

Additional things I have tried:

Any thoughts? Advice?

crusaderky commented 2 years ago

@kathrynberger you should call dump_cluster_state after the cluster has deadlocked.

  1. does setting distributed.worker.validate: True cause the worker to trip faster and with more explicit info?

Any feedback on this?

kathrynberger commented 2 years ago

An update: Using distributed.worker.validate: True did cause the worker to trip faster only when I had an error setting up the dump_cluster_state to s3. After resolving this, it did not trip (but still hung before completion) and I was able to successfully retrieve a dump_cluster_state output.

Conclusion for now is that an upgrade of distributed was the culprit that caused things to tip over and it was an issue with setting up our environment locally that hid that fact earlier this week and escalated the investigation further. After a successful run locally, we will deploy these changes to production. I will be sure to follow up here to confirm it worked in production as well. Thanks again for your time and investigation on this!

@crusaderky @ncclementi and cc' @phobson

ncclementi commented 2 years ago

@kathrynberger would it be possible to coordinate with @phobson to get access to the cluster dump file, so we can take a deeper look at this.?

EDIT: I might have misunderstood, is this still a problem, or you will know when things go into production? Could you tell us more about and how you discover the issue?

Conclusion for now is that an upgrade of distributed was the culprit that caused things to tip over and it was an issue with setting up our environment locally that hid that fact earlier this week and escalated the investigation further.

kathrynberger commented 2 years ago

@kathrynberger would it be possible to coordinate with @phobson to get access to the cluster dump file, so we can take a deeper look at this.?

Of course, @phobson what is the best way to get this to you? Shall I upload it to the issue or get it to you in some other way? And are you alright with the msgpack.gz format?

EDIT: I might have misunderstood, is this still a problem, or you will know when things go into production? Could you tell us more about and how you discover the issue?

To clarify, it appears this is no longer a problem. We had some issues testing this locally that hadn’t convinced us that an upgrade to distributed was the solution. After we were able to fix the local issues, we could successfully run an ingest of monthly data on Coiled. We’re now going to test this on staging before deploying into production, but do not anticipate this to be a problem. I was only going to keep you updated here to confirm it also worked in production. 👍

phobson commented 2 years ago

Ahh cool. Uploading to the issue might not be the best. Care to shoot me an email? paul@coiled.io

kathrynberger commented 2 years ago

Cheers, I had those same concerns about doing so - sending an email over to you shortly 👍

ncclementi commented 2 years ago

@kathrynberger was the problem resolved? Would you mind giving us an update and if there are no more problems, can we close this issue?

kathrynberger commented 2 years ago

hi @ncclementi I can confirm it appears the Coiled problem was solved locally. However, the upgrade of distributed which seemed to be the solution, has caused problems for our deployment in production - which we haven't yet been able to fix.

phobson commented 2 years ago

Hey @kathrynberger

While I'm glad you were able to solve the initial problem locally, can I ask about the issues with deploying? Would it be helpful to have a synchronous conversation? Let me know and I'll follow up via email.

Secondly, the cluster dump file you create was mostly empty. But I think that's okay since the original issue is resolved.

kathrynberger commented 2 years ago

Hey @kathrynberger

While I'm glad you were able to solve the initial problem locally, can I ask about the issues with deploying? Would it be helpful to have a synchronous conversation? Let me know and I'll follow up via email.

Secondly, the cluster dump file you create was mostly empty. But I think that's okay since the original issue is resolved.

Thanks @phobson for the follow up. You're right the clean cluster dump file makes sense as the original issue had been resolved at that point. From this side, it looks to be an issue with our CDK code - so I don't think it is Coiled related. We can close this issue for now.

Thanks for the offer to follow up synchronously - I think we are good for now. We'll sort things within our CDK, but may reach out down the road for insight on how to improve our Coiled workflow. Thanks again