Closed michaelmcguinness closed 7 years ago
Thanks for the detailed description. My suspicion is that this isn't Fabio related but you've done testing around this. I'm currently on vacation and will start picking up Fabio issues as of next week again.
@michaelmcguinness Did you make any progress with this? What do the fabio logs say?
No. Running it as raw_exec to work around the issue. Would I be right in saying that you mean the logs exposed by
$ nomad logs -verbose -stderr <allocid>
That being the case there is nothing interesting there...
2016/12/17 15:52:58 [INFO] Version 1.3.4 starting
2016/12/17 15:52:58 [INFO] Go runtime is go1.7.3
2016/12/17 15:52:58 [INFO] Using routing strategy "rnd"
2016/12/17 15:52:58 [INFO] Using routing matching "prefix"
2016/12/17 15:52:58 [INFO] Setting GOGC=800
2016/12/17 15:52:58 [INFO] Setting GOMAXPROCS=1
2016/12/17 15:52:58 [INFO] Metrics disabled
2016/12/17 15:52:58 [INFO] consul: Connecting to "localhost:8500" in datacenter "vpc-poc"
2016/12/17 15:52:58 [INFO] Admin server listening on ":9998"
2016/12/17 15:52:58 [INFO] HTTP proxy listening on :9999
2016/12/17 15:52:58 [INFO] consul: Using dynamic routes
2016/12/17 15:52:58 [INFO] consul: Using tag prefix "urlprefix-"
2016/12/17 15:52:58 [INFO] consul: Watching KV path "/fabio/config"
2016/12/17 15:52:58 [INFO] consul: Health changed to #2443304
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-http" since agent on node "dns_slave" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-process" since agent on node "dns_slave" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server1" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-rpc" since agent on node "nomad_server1" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-rpc" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-http" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-http" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-process" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Manual config changed to #2409046
2016/12/17 15:52:58 [INFO] Updated config to
2016/12/17 15:52:58 [INFO] consul: Registered fabio with id "fabio-ip-10-75-70-74-9998"
2016/12/17 15:52:58 [INFO] consul: Registered fabio with address "10.75.70.74"
2016/12/17 15:52:58 [INFO] consul: Registered fabio with tags ""
2016/12/17 15:52:58 [INFO] consul: Registered fabio with health check to "http://[10.75.70.74]:9998/health"
2016/12/17 15:52:58 [INFO] consul: Health changed to #2443305
and then just repeated messages about Consul.
If there is nothing in the fabio logs then that supports my suspicion that this is a docker and/or nomad issue. Maybe the way fabio interacts with docker or the way it shuts down triggers this. However, since you're running this with a single listener you could try to simulate this with a simple go program that runs a web server, then a reverse proxy, and then a reverse proxy that makes long polling http requests.
Below is a simple reverse proxy for testing. Store it in ~/gopath/src/fabiotest/main.go
and then build with go build
in that directory. Make sure you have set export GOPATH=~/gopath
. When running you can test both endpoints with curl localhost:9998
and curl localhost:9999
. Except for the long-polling outgoing connection to consul this is in essence the core of fabio. :)
You can use the following Dockerfile
:
FROM scratch
ADD / fabiotest
EXPOSE 9998 9999
CMD ["/fabiotest"]
See how far you get with this.
package main
import (
"flag"
"fmt"
"log"
"net/http"
"net/http/httputil"
"net/url"
)
func main() {
var proxyAddr, uiAddr, proxyURL string
flag.StringVar(&proxyAddr, "proxy", ":9999", "host:port of the proxy")
flag.StringVar(&uiAddr, "ui", ":9998", "host:port of the ui")
flag.StringVar(&proxyURL, "proxyURL", "https://www.google.com/", "proxy url")
flag.Parse()
log.Println("fabiotest starting")
go func() {
u, err := url.Parse(proxyURL)
if err != nil {
log.Fatal("proxyURL:", err)
}
log.Println("proxy listening on", proxyAddr, "proxying", u)
rp := httputil.NewSingleHostReverseProxy(u)
if err := http.ListenAndServe(proxyAddr, rp); err != nil {
log.Fatal("proxy:", err)
}
}()
go func() {
log.Println("UI listening on", uiAddr)
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintln(w, "UI is OK")
})
if err := http.ListenAndServe(uiAddr, nil); err != nil {
log.Fatal("ui:", err)
}
}()
log.Println("Press CTRL-C to stop")
select {}
}
Thanks I'll give this a go and let you know the result. Actually I upgraded to Nomad 0.5.1 just in case but no change.
This issue remains with this test code so as you suspected it is a more generic issue. I actually also tested it with the Consul binary and got the same thing. Thank you for your time on this. I'll head over to the Nomad issues page with it.
As a side note, what should have I been expecting from 'curl localhost:9999'. I am getting a 404. Also I noticed that you posted a Dockerfile. As the issue was specifically for the exec driver I was not sure what you wanted me to do with it.
You may (or may not :)) be interested to know that this issue is something to do with the kernel version and the LVM storage driver implementation. Haven't quite figured it out but switching to AUFS makes the issue go away.
Thanks again for your attention and a great utility.
I am interested and I'm glad you've figured it out. If you have a reference issue for nomad feel free to link it.
Thanks and merry christmas. Enjoy the holidays.
I raised this up with Kelsey Hightower as it was his demo that made me look at it. Not for a fix but just for some info about his env. I'm not sure I am going to burn time raising it with Nomad as we are way behind here with our kernel revision (one of the many things on my to-do list). It seems to me that LVM may not be the strategic storage option for Docker so I think fixing forward by upgrading is the way to go.
I realise how unlikely the title to this issue seems but if there is an obvious error in my set up I can't spot it. I want to run Fabio as a Nomad managed service using the Nomad system scheduler (type = "system"). When I do then any subsequent pulls from our private Docker registry fails with the error
failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory
From that point on I am not able to recover Docker.Some details about our set up: Ubuntu 14-04 Kernel = 3.13.0-53-generic Docker = 1.12.2 Nomad = 0.5.0 Fabio = 1.3.4
I have a 3 x servers with 2 x clients. I am trying to run Fabio using the exec driver and the system scheduler. I am running Nomad as the root user on which I believe is required for the exec driver.
I do not see the issue if I run Fabio using the service scheduler. I do not see the issue if I run a Docker container using the system scheduler . I do not see the issue if I run another job (sleep binary) using the system scheduler. I do not see the issue if I run Fabio using the system scheduler but using the raw_exec driver.
Docker is using the LVM storage option but I see the same issue if I drop back to the devicemapper storage option.
Below is a repeatable test case. After that are copies of the job specs used in the test case.
Go to Nomad user
ubuntu@ip-10-75-70-27:~$ sudo su - nomad
Software versions
Nomad running as root with no running jobs
Demonstrate Docker pull
Remove pulled image
Run 'sleep' test job
Pull Docker image
Remove pulled image
Stop 'sleep' job
Start Fabio job
Pull docker image
Fabio job dies (10 minutes later), from syslog
From Docker log
Fabio Job Spec
Sleep Job Spec