brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 64 forks source link

Surface zqd startup problems #2022

Open philrz opened 3 years ago

philrz commented 3 years ago

Repro is with GA Brim tagged v0.23.0 and GA zq tagged v0.28.0. This issue probably bleeds over between the worlds of Brim and zq, but since triggering the symptom involves zqd, I figure I'll start on this side.

I'm currently helping a community user debug a Brim problem on Windows where they report having been greeted with the "Connection Error: The service at localhost:9867 could not be reached error, similar to this:

image

While we have yet to get to the bottom of precisely why it's failing in their environment, I've been trying to reproduce this class of error so I can get a feel for ways it might be triggered and what the logs look like when it's triggered.

One thought that came to mind is that there could perhaps be some other process listening on TCP port 9867 that would prevent zqd from successfully being able to listen. Indeed, in acquiring the screenshot shown above, I did precisely this. I installed socat (link) and started it with this command line:

socat TCP4-LISTEN:9867,reuseaddr,fork,bind=127.0.0.1 OPEN:\tmp\foo.txt,creat,append

At this point, the contents of Brim's C:\Users\Phil\AppData\Roaming\Brim\logs\main.log shows that it launched zqd and gives no sign that it was less than successful, which is somewhat concerning:

[2021-01-27 12:58:56.144] [info] app paths: getAppPath=C:\Users\Phil\AppData\Local\Brim\app-0.23.0\resources\app userData=C:\Users\Phil\AppData\Roaming\Brim logs=C:\Users\Phil\AppData\Roaming\Brim\logs
[2021-01-27 12:58:56.401] [info] migrations: currentVersion=202101210823 pending=0
[2021-01-27 12:58:56.410] [info] zqd core log C:\Users\Phil\AppData\Roaming\Brim\logs\zqd.log
[2021-01-27 12:58:56.411] [info] zqd config C:\Users\Phil\AppData\Roaming\Brim\zqd-config.yaml
[2021-01-27 12:58:56.413] [info] spawning zqd: C:\Users\Phil\AppData\Local\Brim\app-0.23.0\resources\app\zdeps\zqd.exe listen -l localhost:9867 -data C:\Users\Phil\AppData\Roaming\Brim\data\spaces -config C:\Users\Phil\AppData\Roaming\Brim\zqd-config.yaml -suricatarunner C:\Users\Phil\AppData\Local\Brim\app-0.23.0\resources\app\zdeps\suricata\suricatarunner.exe -suricataupdater C:\Users\Phil\AppData\Local\Brim\app-0.23.0\resources\app\zdeps\suricata\suricataupdater.exe -zeekrunner C:\Users\Phil\AppData\Local\Brim\app-0.23.0\resources\app\zdeps\zeek\zeekrunner.exe

Meanwhile, C:\Users\Phil\AppData\Roaming\Brim\logs\zqd.log does show the failure for zqd to start successfully.

{"level":"info","ts":1611782351.78618,"msg":"Open files limit raised","limit":0}
{"level":"info","ts":1611782351.7951748,"logger":"database","msg":"Loaded","kind":"file","uri":"file:///C:/Users/Phil/AppData/Roaming/Brim/data/spaces/zqd.json"}
{"level":"info","ts":1611782351.7981744,"logger":"manager","msg":"Loaded","root":"file:///C:/Users/Phil/AppData/Roaming/Brim/data/spaces"}
{"level":"info","ts":1611782351.799173,"logger":"core","msg":"Started","personality":"all","suricata_supported":true,"zeek_supported":true}
{"level":"info","ts":1611782351.799173,"msg":"Launching suricata updater"}
{"level":"error","ts":1611782351.8111649,"logger":"httpd","msg":"Listen error","error":"listen tcp 127.0.0.1:9867: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.","stacktrace":"github.com/brimsec/zq/pkg/httpd.(*Server).Start\n\t/home/runner/work/zq/zq/pkg/httpd/server.go:42\ngithub.com/brimsec/zq/ppl/cmd/zqd/listen.(*Command).Run\n\t/home/runner/work/zq/zq/ppl/cmd/zqd/listen/command.go:134\ngithub.com/mccanne/charm.(*instance).run\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/instance.go:38\ngithub.com/mccanne/charm.(*Spec).Exec\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/charm.go:67\ngithub.com/mccanne/charm.(*instance).run\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/instance.go:61\ngithub.com/mccanne/charm.(*Spec).ExecRoot\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/charm.go:77\nmain.main\n\t/home/runner/work/zq/zq/ppl/cmd/zqd/main.go:13\nruntime.main\n\t/opt/hostedtoolcache/go/1.15.6/x64/src/runtime/proc.go:204"}
{"level":"info","ts":1611782351.812165,"logger":"core","msg":"Shutdown","personality":"all"}

My Windows Task Manager does not show zqd.exe running as it would in a normal, healthy launch, so this indicates the process did not start.

image

Normally we require this same :9867 connection for zqd to communicate errors back to Brim, but obviously that's not available in this failure scenario. Perhaps the first question to ask would be: Is there some way for zqd to communicate the nature of the failure directly back to Brim at that spawning step, such that we might be able to provide more detail to the user?

philrz commented 3 years ago

I realize a day later that this is another instance of something we first considered when creating #534. I'll hold it open though since it serves as a specific motivating example.