brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.37k stars 67 forks source link

"pthread_create failed: Resource temporarily unavailable" during Grafana dashboard load #5165

Open philrz opened 1 month ago

philrz commented 1 month ago

tl;dr

The Zed queries that populate a Grafana dashboard using an S3-based Zed lake service can cause a crash. The opening text of the error dump:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort

Details

Repro is with Zed commit f497079.

This one is kind of complex to reproduce since it's effectively a production use case with many layers involved:

  1. Grafana
  2. The grafana-zed-datasource
  3. The zed-js client
  4. The Zed lake service

However, since it all ends in a zed serve crash that I wasn't trying to trigger, I'll start by describing the repro and sharing the dump details, then if necessary I could try to craft a simpler test that exhibits the behaviors to trigger the crash.

The repro involves our "Autoperf" metrics for tracking performance. The metric data is ultimately stored in an S3 bucket, and I run a Zed service locally on my Macbook with the bucket as its lake directory.

$ zed -version
Version: v1.16.0-6-gf4970799

$ zed -lake s3://brimdataio-lakes-us-east-2/shasta.lake.brimdata.io serve

The config for my Zed datasource plugin for Grafana remains at its default setting of http://localhost:9867.

As shown in the attached video, when I access the dashboard, Grafana begins filling in the panels by executing the Zed queries that populate them. Grafana is in some ways a black box, but I've found that if I "scroll down" in the large dashboard, it seems to trigger loading many panels at once (which is to say, if I stayed near the top, it might be a gentler test by only loading the first few panels, and now that I've found the bug I'm intentionally being more tortuous). Because of the WAN latency from my SF location to the S3 bucket in Ohio, I'm accustomed to it taking a long time for the panels to load. What I've observed is that sometimes many of the latter panels will fail to load and be shown with a red "X". As I expect a user would likely do, at this point I refresh to try loading the dashboard again hoping it will work on the second try. However, as shown here, after another long delay a large error dump appears on the zed serve console as it exits.

https://github.com/brimdata/zed/assets/5934157/669d6c6d-1416-4855-932f-9d4f6d02b46c

The complete error dump is large, so I've attached it here: zed-serve-log.txt

This issue does not seem to repro if I copy the lake contents to my Macbook and serve the lake off my local SSD instead of S3.

While I can't make sense of the error dump, given what I know of the setup, I have some theories. My first time seeing this crash happened to be right after I added three additional panels to plot metrics for a newly added test. Given what I know of the high latency between my Macbook and the S3 bucket, I would guess that perhaps there are too many query requests "in flight" and it's exhausting some resources.

I respect that this setup is arguably abnormal. That said, if it's possible that I'm correct about it being a problem with too much concurrent work, perhaps there's still something legitimate to be learned, e.g., maybe the same kind of crash could happen if many users were converging on a lake service and each simultaneously happened to be initiating a lesser number of queries in parallel?

philrz commented 1 month ago

tl;dr

I've attached a shell script repro.sh.gz that reliably triggers the repro for me with GA Zed tagged v1.17.0.

Before running, start:

zed -lake s3://brimdataio-lakes-us-east-2/shasta.lake.brimdata.io serve

Details

I made this script by using Wireshark to sniff the queries to the Zed lake API that were being issued by Grafana when loading the dashboard. There's actually more queries in the pcap than I transcribed (i.e., I only made it as big as I needed to trigger the problem) so if this doesn't work for others on the first try (geographic locations that see lower latency to the S3 bucket in Ohio might fare better than me) just say the word and I can add more.

philrz commented 3 weeks ago

I happened to just now notice some slow performance when accessing that Shasta S3 bucket. On closer inspection it appears compaction has not run successfully on the Autoperf pool since April, even though the zed manage process has been running this whole time. That makes me think that the extra-slow performance puling so many tiny objects may go a long way toward aggravating this problem. If we end up fixing whatever's wrong there such that it compacts again, this crash might magically go away, but if so I could probably create the repro conditions again by manually creating a fragmented, non-compacted pool.

philrz commented 2 weeks ago

Per the previous comment, the Zed services on Shasta have been updated to GA v1.17.0, compaction ran successfully, and indeed the problem no longer reproduces when the repro script is run against the pool data in this compacted form. However, anticipating this, I saved off the pool beforehand in its fragmented state, so to repro the problem, now do:

zed -lake s3://brimdataio-lakes-us-east-2/shasta-fragmented/shasta.lake.brimdata.io serve

...and the repro script still triggers the crash just as before. I just confirmed this with Zed commit 54a5a15.