Multiple containers spawned

t-lin commented 4 years ago

Not sure if this is a timing issue or not (e.g. proxy hasn't had time to come up yet).

When trying the hello-world example, I noticed the allocator properly created a container, but the proxy (source proxy) was unable to contact the destination proxy, so it fell-back to creating another container, and this cycle repeated.

Log from source proxy:

2020/06/04 18:02:52.959139 proxy.go:46: Got request: hello-world-server
2020/06/04 18:02:52.959310 proxy.go:51: [ hello-world-server]
2020/06/04 18:02:52.959333 proxy.go:54: Looking for service with name hello-world-server in hash-lookup
2020/06/04 18:02:52.961129 hl-common.go:99: Connecting to: {Qmaq76Lt4oEiYEbkxwCb6CgKbbp9qw5eWTexsrm84D2hJW: [/ip4/127.0.0.1/tcp/40863 /ip4/10.11.69.20/tcp/40863 /ip4/172.17.0.1/tcp/40863 /i
p6/::1/tcp/42285]}
2020/06/04 18:02:52.992751 proxy.go:84: Get peer returned id , serviceAddr , and err No suitable peer found in cache
2020/06/04 18:02:52.992856 proxy.go:87: Finding best existing service instance
2020/06/04 18:02:52.993179 lca-manager.go:31: Finding providers for: QmeASHQ2gfNQvRWJXTeL87ft8Yu8QH14wCj6M4Ze9MawhS
2020/06/04 18:02:52.994745 proxy.go:93: Could not find, creating new service instance
2020/06/04 18:02:52.999095 lca-manager.go:135: Attempting to contact peer with pid: QmXYSnzsV6Tj2CmkEMpYN1bfGqScvLRRyf1MxZoq83GDk7
2020/06/04 18:02:54.492842 lca-manager.go:93: Error reading from buffer
2020/06/04 18:02:54.493571 lca-manager.go:135: Attempting to contact peer with pid: QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN
2020/06/04 18:02:57.989388 lca-manager.go:93: Error reading from buffer
2020/06/04 18:02:57.989459 proxy.go:96: No services able to be found or created
 Could not find peer to allocate service
2020/06/04 18:03:02.989634 proxy.go:84: Get peer returned id , serviceAddr , and err No suitable peer found in cache
2020/06/04 18:03:02.989679 proxy.go:87: Finding best existing service instance
2020/06/04 18:03:02.989695 lca-manager.go:31: Finding providers for: QmeASHQ2gfNQvRWJXTeL87ft8Yu8QH14wCj6M4Ze9MawhS
2020/06/04 18:03:02.992347 proxy.go:93: Could not find, creating new service instance
2020/06/04 18:03:02.995635 lca-manager.go:135: Attempting to contact peer with pid: QmXYSnzsV6Tj2CmkEMpYN1bfGqScvLRRyf1MxZoq83GDk7
2020/06/04 18:03:04.350815 lca-manager.go:93: Error reading from buffer
2020/06/04 18:03:04.350874 lca-manager.go:135: Attempting to contact peer with pid: QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN
2020/06/04 18:03:06.006191 lca-manager.go:93: Error reading from buffer
2020/06/04 18:03:06.006258 proxy.go:96: No services able to be found or created
 Could not find peer to allocate service
2020/06/04 18:03:11.006428 proxy.go:84: Get peer returned id , serviceAddr , and err No suitable peer found in cache
2020/06/04 18:03:11.006471 proxy.go:87: Finding best existing service instance
2020/06/04 18:03:11.006486 lca-manager.go:31: Finding providers for: QmeASHQ2gfNQvRWJXTeL87ft8Yu8QH14wCj6M4Ze9MawhS
2020/06/04 18:03:11.008485 proxy.go:93: Could not find, creating new service instance
2020/06/04 18:03:11.011418 lca-manager.go:135: Attempting to contact peer with pid: QmXYSnzsV6Tj2CmkEMpYN1bfGqScvLRRyf1MxZoq83GDk7
2020/06/04 18:03:12.294791 lca-manager.go:93: Error reading from buffer
2020/06/04 18:03:12.294845 lca-manager.go:135: Attempting to contact peer with pid: QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN
2020/06/04 18:03:13.980985 lca-manager.go:93: Error reading from buffer
2020/06/04 18:03:13.981364 proxy.go:96: No services able to be found or created
 Could not find peer to allocate service

The result is that multiple containers was created within a short timespan (in each node with an allocator):

8dbc4fbc883e        tlin/tlintest       "/bin/sh -c './proxy…"   31 seconds ago       Up 29 seconds                           epic_murdock
06ccdb5cd3fd        tlin/tlintest       "/bin/sh -c './proxy…"   51 seconds ago       Up 49 seconds                           strange_bell
73a7ddd02bed        tlin/tlintest       "/bin/sh -c './proxy…"   59 seconds ago       Up 57 seconds                           dazzling_mendeleev

A subsequent attempt seemed to have succeeded, but only after the second container was created:

2020/06/04 18:03:31.171961 proxy.go:46: Got request: hello-world-server
2020/06/04 18:03:31.172618 proxy.go:51: [ hello-world-server]
2020/06/04 18:03:31.172794 proxy.go:54: Looking for service with name hello-world-server in hash-lookup
2020/06/04 18:03:31.174696 hl-common.go:99: Connecting to: {Qmaq76Lt4oEiYEbkxwCb6CgKbbp9qw5eWTexsrm84D2hJW: [/ip4/127.0.0.1/tcp/40863 /ip4/10.11.69.20/tcp/40863 /ip4/172.17.0.1/tcp/40863 /ip6/::1/tcp/42285]}
2020/06/04 18:03:31.176715 proxy.go:84: Get peer returned id , serviceAddr , and err No suitable peer found in cache
2020/06/04 18:03:31.176741 proxy.go:87: Finding best existing service instance
2020/06/04 18:03:31.176753 lca-manager.go:31: Finding providers for: QmeASHQ2gfNQvRWJXTeL87ft8Yu8QH14wCj6M4Ze9MawhS
2020/06/04 18:03:31.178444 proxy.go:93: Could not find, creating new service instance
2020/06/04 18:03:31.181022 lca-manager.go:135: Attempting to contact peer with pid: QmXYSnzsV6Tj2CmkEMpYN1bfGqScvLRRyf1MxZoq83GDk7
2020/06/04 18:03:32.531505 lca-manager.go:93: Error reading from buffer
2020/06/04 18:03:32.531588 lca-manager.go:135: Attempting to contact peer with pid: QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN
2020/06/04 18:03:34.035554 lca-manager.go:99: New instance: 10.11.69.5:42407
2020/06/04 18:03:34.036030 proxy.go:110: Find/alloc service took: 2.859275791s
2020/06/04 18:03:34.036162 pcache.go:95: Adding new peer with ID QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN
2020/06/04 18:03:39.036467 pcache.go:142: Getting peer with ID QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN from pcache
2020/06/04 18:03:39.036531 proxy.go:84: Get peer returned id QmPqv37ukZLuVKfz5vBaH5KyMR9FCo8FuaRpXg7aKwcsgN, serviceAddr 10.11.69.5:42407, and err <nil>
2020/06/04 18:03:39.036547 proxy.go:125: Running request: http://10.11.69.5:42407/
2020/06/04 18:03:39.039378 proxy.go:141: Sending response back to requester

t-lin commented 4 years ago

Interesting... seems like the proxy processes are dying and becoming zombie processes.

$ ps faux | grep -i proxy
root     10455  0.0  0.0      0     0 pts/0    Z+   18:03   0:00  |       \_ [proxy] <defunct>
root     11249  0.7  0.0      0     0 pts/0    Z+   21:30   0:00  |       \_ [proxy] <defunct>
root     11331  0.5  0.0      0     0 pts/0    Z+   21:31   0:00  |       \_ [proxy] <defunct>
root     11401  2.1  0.0      0     0 pts/0    Z+   21:31   0:00          \_ [proxy] <defunct>
ubuntu   11431  0.0  0.0  14856  1024 pts/2    S+   21:31   0:00  |   \_ grep --color=auto -i proxy

This explains another oddity I've observed... the ping-monitor does not have metrics for any of the proxies.

t-lin commented 4 years ago

Proxy does not seem able to connect to peers:

2020/06/05 19:53:18.873145 p2pnode.go:186: Creating new p2p host
2020/06/05 19:53:19.235917 p2pnode.go:196: Setting stream handlers
2020/06/05 19:53:19.236008 p2pnode.go:206: Creating DHT
2020/06/05 19:53:19.236191 p2pnode.go:269: No bootstraps provided, not connecting to any peers
2020/06/05 19:53:19.236205 p2pnode.go:286: Creating Routing Discovery
2020/06/05 19:53:19.236555 p2pnode.go:297: Finished setting up libp2p Node with PID QmNMsU6e7s9AkUWR1PjnFBLcEUcEAJNKS6T6XNQCYLH8Tn and Multiaddresses [/ip4/127.0.0.1/tcp/44319 /ip4/10.11.69.11/tcp/44319 /ip4/172.17.0.1/tcp/44319 /ip6/::1/tcp/39187]
Unable to connect to any peers, retrying in 2 seconds...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:21.237284 hl-common.go:84:
Unable to connect to any peers, retrying in 4 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:25.238479 hl-common.go:84:
Unable to connect to any peers, retrying in 8 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:33.240626 hl-common.go:84:
Unable to connect to any peers, retrying in 16 seconds...
...
Unable to connect to any peers, retrying in 1 seconds...
2020/06/05 19:53:49.244927 hl-common.go:84:
2020/06/05 19:53:49.246072 proxy.go:307: ERROR: Unable to create LCA Manager
hl-common: Failed to connect to any hash-lookup peers

t-lin commented 4 years ago

The current issue observed is caused by the regression reported in #39.

Others have reported this issue existed before the regression, so #39 will need to be fixed first, then this issue will be re-assessed to see if it still persists.

t-lin commented 4 years ago

The proxy issues in #39 have been resolved. With small images (e.g. the hello-world-server tests), the multiple container issue no longer exists.

However, multiple containers still do get created in the event that the container takes a while to fully start. This has been simulated by simply adding a sleep 60 in front of the container's CMD. Keeping this issue open for now (will rename to remove proxy as root-cause).

t-lin commented 4 years ago

Recent bug fixes (#39 then #40) seems to have solved this issue.

Even with the sleep 60, the allocator returns the appropriate endpoint info back to the proxy. When it attempts to connect to the endpoint, it'll fail (due to the service not being up yet) as expected, but will not create extra duplicate containers. Thus, closing this issue.

t-lin commented 4 years ago

This issue appears to be back with the HTTP-over-P2P proxy. Likely root-cause is the fact that we currently have no way to tell when a container is "ready", thus we fail and keep re-trying, and hence keep booting new containers.

Will need to investigate and figure out a fix (or temporary work-around until we have a "ready" mechanism similar to what K8s offers).

PhysarumSM / service-manager

Multiple containers spawned #34