Open t-lin opened 4 years ago
Interesting... seems like the proxy processes are dying and becoming zombie processes.
$ ps faux | grep -i proxy
root 10455 0.0 0.0 0 0 pts/0 Z+ 18:03 0:00 | \_ [proxy] <defunct>
root 11249 0.7 0.0 0 0 pts/0 Z+ 21:30 0:00 | \_ [proxy] <defunct>
root 11331 0.5 0.0 0 0 pts/0 Z+ 21:31 0:00 | \_ [proxy] <defunct>
root 11401 2.1 0.0 0 0 pts/0 Z+ 21:31 0:00 \_ [proxy] <defunct>
ubuntu 11431 0.0 0.0 14856 1024 pts/2 S+ 21:31 0:00 | \_ grep --color=auto -i proxy
This explains another oddity I've observed... the ping-monitor
does not have metrics for any of the proxies.
Proxy does not seem able to connect to peers:
2020/06/05 19:53:18.873145 p2pnode.go:186: Creating new p2p host
2020/06/05 19:53:19.235917 p2pnode.go:196: Setting stream handlers
2020/06/05 19:53:19.236008 p2pnode.go:206: Creating DHT
2020/06/05 19:53:19.236191 p2pnode.go:269: No bootstraps provided, not connecting to any peers
2020/06/05 19:53:19.236205 p2pnode.go:286: Creating Routing Discovery
2020/06/05 19:53:19.236555 p2pnode.go:297: Finished setting up libp2p Node with PID QmNMsU6e7s9AkUWR1PjnFBLcEUcEAJNKS6T6XNQCYLH8Tn and Multiaddresses [/ip4/127.0.0.1/tcp/44319 /ip4/10.11.69.11/tcp/44319 /ip4/172.17.0.1/tcp/44319 /ip6/::1/tcp/39187]
Unable to connect to any peers, retrying in 2 seconds...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:21.237284 hl-common.go:84:
Unable to connect to any peers, retrying in 4 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:25.238479 hl-common.go:84:
Unable to connect to any peers, retrying in 8 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:33.240626 hl-common.go:84:
Unable to connect to any peers, retrying in 16 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:49.244927 hl-common.go:84:
2020/06/05 19:53:49.246072 proxy.go:307: ERROR: Unable to create LCA Manager
hl-common: Failed to connect to any hash-lookup peers
The current issue observed is caused by the regression reported in #39.
Others have reported this issue existed before the regression, so #39 will need to be fixed first, then this issue will be re-assessed to see if it still persists.
The proxy issues in #39 have been resolved. With small images (e.g. the hello-world-server
tests), the multiple container issue no longer exists.
However, multiple containers still do get created in the event that the container takes a while to fully start. This has been simulated by simply adding a sleep 60
in front of the container's CMD
. Keeping this issue open for now (will rename to remove proxy as root-cause).
Recent bug fixes (#39 then #40) seems to have solved this issue.
Even with the sleep 60
, the allocator
returns the appropriate endpoint info back to the proxy
. When it attempts to connect to the endpoint, it'll fail (due to the service not being up yet) as expected, but will not create extra duplicate containers. Thus, closing this issue.
This issue appears to be back with the HTTP-over-P2P proxy. Likely root-cause is the fact that we currently have no way to tell when a container is "ready", thus we fail and keep re-trying, and hence keep booting new containers.
Will need to investigate and figure out a fix (or temporary work-around until we have a "ready" mechanism similar to what K8s offers).
Not sure if this is a timing issue or not (e.g. proxy hasn't had time to come up yet).
When trying the hello-world example, I noticed the
allocator
properly created a container, but theproxy
(sourceproxy
) was unable to contact the destinationproxy
, so it fell-back to creating another container, and this cycle repeated.Log from source proxy:
The result is that multiple containers was created within a short timespan (in each node with an
allocator
):A subsequent attempt seemed to have succeeded, but only after the second container was created: