Need to limit DB connections at the persistence layer, especially during reload/replay

jan-g commented 6 years ago

I've a DB with a bunch of active graphs in it - a few thousand.

During replay at a shard restart, I'm seeing the following:

time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/0246f0d7-da44-4de1-97e9-30afd2d6d998" graph_id=0246f0d7-da44-4de1-97e9-30afd2d6d998 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/35203cfc-7c0e-4216-847e-7bcc1869fa63" graph_id=35203cfc-7c0e-4216-847e-7bcc1869fa63 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac logger=persistence
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/71c3eba2-43f0-41e6-a9a5-1ef836b74513 logger=persistence
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/f4602309-139e-4fc5-b226-6f33a7528346 logger=persistence

Error 1040 indicates too many parallel connections to the DB. It looks like we're generating these in parallel.

jan-g commented 6 years ago

That followed later (immediately after the initial replay) with:

2017/11/08 15:57:39 [MAILBOX] [ACTOR] Recovering actor="nonhost/supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110"
time="2017-11-08T15:57:39Z" level=warning msg="Stopping failed graph actor due to error: Error 1040: Too many connections" logger=graphmanager_actor
2017/11/08 15:57:39 [ACTOR] [SUPERVISION] time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/2a3dada2-acf3-40cb-94dc-736d02bfb624 error="Error 1040: Too many connections" logger=persistence
actor="nonhost/supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac" time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/a91835ca-51fe-4142-b519-7d3ae3f341d3 error="Error 1040: Too many connections" logger=persistence 
directive="StopDirective" reason="Error 1040: Too many connections"2017/11/08 15:57:39 [MAILBOX]  [ACTOR] Recovering
 actor="nonhost/supervisor-0/2a3dada2-acf3-40cb-94dc-736d02bfb624" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110"
time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/088b4b09-6f1b-4bb0-a543-1a52c67920ca error="Error 1040: Too many connections" logger=persistence
2017/11/08 15:57:39 [MAILBOX] [ACTOR] Recovering time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/71c3eba2-43f0-41e6-a9a5-1ef836b74513 error="Error 1040: Too many connections" logger=persistence
actor="nonhost/supervisor-0/a91835ca-51fe-4142-b519-7d3ae3f341d3" reason="Error 1040: Too many connections" time="2017-11-08T15:57:39Z" level=warning msg="Stopping failed graph actor due to error: Error 1040: Too many connections" logger=graphmanager_actor
2017/11/08 15:57:39 [ACTOR] [SUPERVISION] stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110" 2017/11/08 15:57:39 [MAILBOX]

After startup the flow service stabilises and normal operation is possible.

jan-g commented 6 years ago

Complete trace from one graph. (The actor is not eventually materialised in this case.)

time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0 
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0 logger=persistence 
 2017/11/08 15:57:39 [ACTOR] [SUPERVISION]reason="Error 1040: Too many connections"  time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0 error="Error 1040: Too many connections" logger=persistence 
actor="nonhost/supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110" 
 2017/11/08 15:57:39 [ACTOR] [SUPERVISION] reason="Error 1040: Too many connections" actor="nonhost/supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110" directive="StopDirective"
time="2017-11-08T15:57:39Z" level=info msg="Graph actor terminated" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0 
time="2017-11-08T15:57:39Z" level=warning msg="Graph actor crashed" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0

jan-g commented 6 years ago

Curling to get the graph details after startup for that graph will pull it in later on, as you might expect:

curl http://10.0.20.4:8081/graph/8985ff69-d5b8-4a47-a288-127e07847ef0
{"stages":{"0":{"type":"terminationHook","status":"pending","dependencies":["_termination"]},"1":{"type":"completedValue","status":"successful"}},"function_id":"t/test","graph_id":"8985ff69-d5b8-4a47-a288-127e07847ef0"}

(Note, this graph was only partially constructed and not committed earlier.)

The kubectl logs from the flow service:

time="2017-11-08T16:35:30Z" level=debug msg="Got shard 0 for graph 8985ff69-d5b8-4a47-a288-127e07847ef0" logger=sharding 
time="2017-11-08T16:35:30Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=cluster 
time="2017-11-08T16:35:30Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=cluster 
time="2017-11-08T16:35:30Z" level=info msg="Processing request locally" logger=cluster 
time="2017-11-08T16:35:30Z" level=debug msg="Getting graph stage" logger=graphmanager_actor 
time="2017-11-08T16:35:30Z" level=debug msg="Got shard 0 for graph 8985ff69-d5b8-4a47-a288-127e07847ef0" logger=sharding 
time="2017-11-08T16:35:30Z" level=info msg="Created graph actor supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0 
time="2017-11-08T16:35:30Z" level=info msg="Adding termination hook" logger=graph stage_id=0 
time="2017-11-08T16:35:30Z" level=info msg="Adding stage to graph" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph op=completedValue stage=1 
time="2017-11-08T16:35:30Z" level=info msg="Completing node" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph stage_id=1 success=true 
time="2017-11-08T16:35:30Z" level=info msg="Retrieved stages from storage" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph 
[GIN] 2017/11/08 - 16:35:30 | 200 |  543.781069ms |      10.99.24.0 | GET      /graph/8985ff69-d5b8-4a47-a288-127e07847ef0

jan-g commented 6 years ago

Here's a graph that was created during a flow load test run. Again, the persistence layer is operating in parallel and we're bumping up against the DB connection limit (which is set to 500 in this deployment).

time="2017-11-08T16:39:50Z" level=info msg="Generated new graph ID 64df9229-de84-4416-b60e-808a36141c0c" logger=server 
time="2017-11-08T16:39:50Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:39:50Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:39:50Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:39:50Z" level=debug msg="Creating graph" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graphmanager_actor 
time="2017-11-08T16:39:50Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:39:51Z" level=info msg="Created graph actor supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0 
time="2017-11-08T16:40:30Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:40:30Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:30Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:30Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
[GIN] 2017/11/08 - 16:40:34 | 200 |  3.647872425s |      10.99.24.0 | POST     /graph/64df9229-de84-4416-b60e-808a36141c0c/terminationHook
time="2017-11-08T16:40:34Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:40:34Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:34Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:34Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:40:40Z" level=info msg="Adding stage to graph" function_id=t/test graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graph op=completedValue stage=1 
time="2017-11-08T16:40:40Z" level=info msg="Completing node" function_id=t/test graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graph stage_id=1 success=true 
[GIN] 2017/11/08 - 16:40:40 | 200 |  5.511962778s |      10.99.24.0 | POST     /graph/64df9229-de84-4416-b60e-808a36141c0c/completedValue
time="2017-11-08T16:40:40Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
time="2017-11-08T16:40:40Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:40Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster 
time="2017-11-08T16:40:40Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding 
2017/11/08 16:40:40 [ACTOR] [SUPERVISION] 2017/11/08 16:40:40 [MAILBOX] [ACTOR] Recovering actor="nonhost/supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" reason="Error 1040: Too many connections" actor="nonhost/supervisor-0/27b6e098-aa7d-43b8-882a-8768af8461c0" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).PersistEvent:145" 
2017/11/08 16:40:40 [ACTOR] [SUPERVISION] actor="nonhost/supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" directive="StopDirective" reason="Error 1040: Too many connections" 
time="2017-11-08T16:40:40Z" level=info msg="Graph actor terminated" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0 
time="2017-11-08T16:40:40Z" level=warning msg="Graph actor crashed" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0 
[GIN] 2017/11/08 - 16:41:40 | 500 | 1m0.002378519s |      10.99.24.0 | POST     /graph/64df9229-de84-4416-b60e-808a36141c0c/invokeFunction?functionId=.%2Fsleep

fnproject / flow

Need to limit DB connections at the persistence layer, especially during reload/replay #104