Repro is with GA Brim tagged v0.23.0 and GA zq tagged v0.28.0. This issue probably bleeds over between the worlds of Brim and zq, but since triggering the symptom involves zqd, I figure I'll start on this side.
I'm currently helping a community user debug a Brim problem on Windows where they report having been greeted with the "Connection Error: The service at localhost:9867 could not be reached error, similar to this:
While we have yet to get to the bottom of precisely why it's failing in their environment, I've been trying to reproduce this class of error so I can get a feel for ways it might be triggered and what the logs look like when it's triggered.
One thought that came to mind is that there could perhaps be some other process listening on TCP port 9867 that would prevent zqd from successfully being able to listen. Indeed, in acquiring the screenshot shown above, I did precisely this. I installed socat (link) and started it with this command line:
At this point, the contents of Brim's C:\Users\Phil\AppData\Roaming\Brim\logs\main.log shows that it launched zqd and gives no sign that it was less than successful, which is somewhat concerning:
Meanwhile, C:\Users\Phil\AppData\Roaming\Brim\logs\zqd.logdoes show the failure for zqd to start successfully.
{"level":"info","ts":1611782351.78618,"msg":"Open files limit raised","limit":0}
{"level":"info","ts":1611782351.7951748,"logger":"database","msg":"Loaded","kind":"file","uri":"file:///C:/Users/Phil/AppData/Roaming/Brim/data/spaces/zqd.json"}
{"level":"info","ts":1611782351.7981744,"logger":"manager","msg":"Loaded","root":"file:///C:/Users/Phil/AppData/Roaming/Brim/data/spaces"}
{"level":"info","ts":1611782351.799173,"logger":"core","msg":"Started","personality":"all","suricata_supported":true,"zeek_supported":true}
{"level":"info","ts":1611782351.799173,"msg":"Launching suricata updater"}
{"level":"error","ts":1611782351.8111649,"logger":"httpd","msg":"Listen error","error":"listen tcp 127.0.0.1:9867: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.","stacktrace":"github.com/brimsec/zq/pkg/httpd.(*Server).Start\n\t/home/runner/work/zq/zq/pkg/httpd/server.go:42\ngithub.com/brimsec/zq/ppl/cmd/zqd/listen.(*Command).Run\n\t/home/runner/work/zq/zq/ppl/cmd/zqd/listen/command.go:134\ngithub.com/mccanne/charm.(*instance).run\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/instance.go:38\ngithub.com/mccanne/charm.(*Spec).Exec\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/charm.go:67\ngithub.com/mccanne/charm.(*instance).run\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/instance.go:61\ngithub.com/mccanne/charm.(*Spec).ExecRoot\n\t/home/runner/go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/charm.go:77\nmain.main\n\t/home/runner/work/zq/zq/ppl/cmd/zqd/main.go:13\nruntime.main\n\t/opt/hostedtoolcache/go/1.15.6/x64/src/runtime/proc.go:204"}
{"level":"info","ts":1611782351.812165,"logger":"core","msg":"Shutdown","personality":"all"}
My Windows Task Manager does not show zqd.exe running as it would in a normal, healthy launch, so this indicates the process did not start.
Normally we require this same :9867 connection for zqd to communicate errors back to Brim, but obviously that's not available in this failure scenario. Perhaps the first question to ask would be: Is there some way for zqd to communicate the nature of the failure directly back to Brim at that spawning step, such that we might be able to provide more detail to the user?
I realize a day later that this is another instance of something we first considered when creating #534. I'll hold it open though since it serves as a specific motivating example.
Repro is with GA Brim tagged
v0.23.0
and GAzq
taggedv0.28.0
. This issue probably bleeds over between the worlds of Brim andzq
, but since triggering the symptom involveszqd
, I figure I'll start on this side.I'm currently helping a community user debug a Brim problem on Windows where they report having been greeted with the "Connection Error: The service at localhost:9867 could not be reached error, similar to this:
While we have yet to get to the bottom of precisely why it's failing in their environment, I've been trying to reproduce this class of error so I can get a feel for ways it might be triggered and what the logs look like when it's triggered.
One thought that came to mind is that there could perhaps be some other process listening on TCP port
9867
that would preventzqd
from successfully being able to listen. Indeed, in acquiring the screenshot shown above, I did precisely this. I installedsocat
(link) and started it with this command line:At this point, the contents of Brim's
C:\Users\Phil\AppData\Roaming\Brim\logs\main.log
shows that it launchedzqd
and gives no sign that it was less than successful, which is somewhat concerning:Meanwhile,
C:\Users\Phil\AppData\Roaming\Brim\logs\zqd.log
does show the failure forzqd
to start successfully.My Windows Task Manager does not show
zqd.exe
running as it would in a normal, healthy launch, so this indicates the process did not start.Normally we require this same
:9867
connection forzqd
to communicate errors back to Brim, but obviously that's not available in this failure scenario. Perhaps the first question to ask would be: Is there some way forzqd
to communicate the nature of the failure directly back to Brim at thatspawning
step, such that we might be able to provide more detail to the user?