digital-asset / daml

The Daml smart contract language
https://www.digitalasset.com/developers
Other
802 stars 204 forks source link

MAXIMUM_NUMBER_OF_STREAMS thrown even though nr of active clients is low #15551

Closed GeorgSchneider closed 1 year ago

GeorgSchneider commented 2 years ago

Affected Daml version

2.5.0-snapshot.20221107.10906.0.346bb489 (possibly earlier ones as well, but I've not seen it with 2.5.0-snapshot.20221010.10736.0.2f453a14)

Bug description

After running and using the daml-finance-app for a while I'm getting errors relating to websocket streams:

11-11-2022 14:55:32.225 [http-json-ledger-api-akka.actor.default-dispatcher-20] INFO  com.daml.http.WebSocketService - Websocket client interrupted on Failure: ABORTED: MAXIMUM_NUMBER_OF_STREAMS(2,0): The number of streams in use (1000) has reached or exceeded the limit (1000). Metrics are available at daml.lapi.streams.active.. remaining number of clients: 39, context: {instance_uuid: "36983077-5bda-45cf-85b7-81081f8a825e", request_id: "cfdd9502-2cc9-4c85-86fd-09b95a83fbac"} 

I'm also occasionally seeing this one, but not sure it's related:

11-11-2022 14:58:35.623 [http-json-ledger-api-akka.actor.default-dispatcher-13] ERROR akka.actor.ActorSystemImpl - Websocket handler failed with ABORTED: THREADPOOL_OVERLOADED(2,0): The Index Database Connection Threadpool queue size (1005) has exceeded the maximum (1000). Api services metrics are available at daml.index.db.threadpool.connection.api-server. 
io.grpc.StatusRuntimeException: ABORTED: THREADPOOL_OVERLOADED(2,0): The Index Database Connection Threadpool queue size (1005) has exceeded the maximum (1000). Api services metrics are available at daml.index.db.threadpool.connection.api-server.
        at io.grpc.Status.asRuntimeException(Status.java:535)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

The app gets unresponsive after seeing the first error, so it seems that it's not just wrong reporting of nr of streams.

I'm not sure if this was already present in a previous snapshot, I've only seen it happen now on the latest (version above).

To reproduce

As said, this seems to be related to some recent changes, as I've never seen this on older snapshots or <= 2.4.x

S11001001 commented 1 year ago

@GeorgSchneider This error is passing through from the participant. Have you seen whether JSON API has >=1000 active TCP connections to the participant when this occurs?

Your log message implies that JSON API is effectively managing 39 connections at the time of error; either the other 961 have been forgotten but still exist, or they really don't exist and have been miscounted elsewhere.

GeorgSchneider commented 1 year ago

@S11001001 do you have a handy command at hand that I can use to see which process has TCP connections open? I will run it next time I encounter it.

S11001001 commented 1 year ago

@GeorgSchneider I think you can do netstat -p tcp | grep $PORT where PORT is your participant server port. ESTABLISHED connections are active. CLOSE_WAIT connections are also interesting, if there are very many of them they are a sign of a different problem worth knowing about.

I'm not sure how to filter down to what JSON API is using, but I assume you have nothing else connecting to the participant in your case.

GeorgSchneider commented 1 year ago

It does not seem to have excessive TCP connections open. Anything else I can check?

Output from netstat -p tcp:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 localhost:41298         localhost:33129         ESTABLISHED 1410/node           
tcp        0      0 localhost:48502         localhost:33129         ESTABLISHED 30647/node          
tcp        0      0 localhost:33129         localhost:48500         ESTABLISHED 1376/node           
tcp        0      0 localhost:41300         localhost:33129         ESTABLISHED 1439/node           
tcp        0      0 localhost:33129         localhost:41300         ESTABLISHED 1523/node           
tcp        0      0 localhost:33129         localhost:48502         ESTABLISHED 30654/node          
tcp        0      0 localhost:3000          localhost:49064         ESTABLISHED 858/node            
tcp        0      0 localhost:49064         localhost:3000          ESTABLISHED -                   
tcp        0      0 localhost:48500         localhost:33129         ESTABLISHED 30638/node          
tcp        0      0 localhost:33129         localhost:41298         ESTABLISHED 1376/node           
tcp6       0      0 localhost:60162         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6867          localhost:60160         ESTABLISHED 32593/java          
tcp6       0      0 localhost:38040         localhost:6865          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6865          localhost:38040         ESTABLISHED 32593/java          
tcp6       0      0 localhost:60190         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6867          localhost:60164         ESTABLISHED 32593/java          
tcp6       0      0 localhost:60166         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:54090         localhost:35611         ESTABLISHED 1579/ld-linux-x86-6 
tcp6       0      0 localhost:38080         localhost:6865          ESTABLISHED 428/java            
tcp6       0      0 localhost:6865          localhost:38080         ESTABLISHED 32593/java          
tcp6       0      0 localhost:60164         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6867          localhost:60162         ESTABLISHED 32593/java          
tcp6       0      0 localhost:60188         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:38120         localhost:6865          ESTABLISHED 428/java            
tcp6       0      0 localhost:6867          localhost:60190         ESTABLISHED 32593/java          
tcp6       0      0 localhost:42507         localhost:36406         ESTABLISHED 31205/java          
tcp6       0      0 localhost:60160         localhost:6867          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6867          localhost:60188         ESTABLISHED 32593/java          
tcp6       0      0 localhost:35611         localhost:54090         ESTABLISHED 1647/java           
tcp6       0      0 localhost:36406         localhost:42507         ESTABLISHED 30710/ld-linux-x86- 
tcp6       0      0 localhost:6865          localhost:38120         ESTABLISHED 32593/java          
tcp6       0      0 localhost:38038         localhost:6865          ESTABLISHED 32593/java          
tcp6       0      0 localhost:6867          localhost:60166         ESTABLISHED 32593/java          
tcp6       0      0 localhost:6865          localhost:38038         ESTABLISHED 32593/java          
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
unix  3      [ ]         STREAM     CONNECTED     78344    30654/node           
unix  3      [ ]         STREAM     CONNECTED     74380    30654/node           
unix  2      [ ]         STREAM     CONNECTED     83260    32593/java           
unix  3      [ ]         STREAM     CONNECTED     74407    30654/node           
unix  2      [ ]         STREAM     CONNECTED     82245    30043/git            
unix  3      [ ]         STREAM     CONNECTED     78353    30665/node           
unix  3      [ ]         STREAM     CONNECTED     48421    19789/node           
unix  3      [ ]         STREAM     CONNECTED     78345    1376/node            
unix  3      [ ]         STREAM     CONNECTED     78348    30654/node           
unix  3      [ ]         STREAM     CONNECTED     74383    30699/ld-linux-x86-  
unix  3      [ ]         STREAM     CONNECTED     78352    1376/node            
unix  3      [ ]         STREAM     CONNECTED     74382    30654/node           
unix  2      [ ]         STREAM     CONNECTED     83093    31205/java           
unix  2      [ ]         STREAM     CONNECTED     74617    31205/java           
unix  2      [ ]         STREAM     CONNECTED     31791    5569/git             
unix  3      [ ]         STREAM     CONNECTED     2719     1568/ld-linux-x86-6  
unix  3      [ ]         STREAM     CONNECTED     74406    30725/git            
unix  3      [ ]         STREAM     CONNECTED     2717     1568/ld-linux-x86-6  
unix  2      [ ]         STREAM     CONNECTED     31793    5569/git             
unix  2      [ ]         STREAM     CONNECTED     78625    32593/java           
unix  3      [ ]         STREAM     CONNECTED     48420    1376/node            
unix  3      [ ]         STREAM     CONNECTED     78347    1376/node            
unix  3      [ ]         STREAM     CONNECTED     74408    30725/git            
unix  3      [ ]         STREAM     CONNECTED     74405    30654/node           
unix  3      [ ]         STREAM     CONNECTED     2716     1523/node            
unix  3      [ ]         STREAM     CONNECTED     2715     1568/ld-linux-x86-6  
unix  3      [ ]         STREAM     CONNECTED     78343    1376/node            
unix  3      [ ]         STREAM     CONNECTED     2718     1523/node            
unix  3      [ ]         STREAM     CONNECTED     78349    1376/node            
unix  3      [ ]         STREAM     CONNECTED     2714     1523/node            
unix  3      [ ]         STREAM     CONNECTED     78346    30654/node           
unix  3      [ ]         STREAM     CONNECTED     74381    30699/ld-linux-x86-  
unix  2      [ ]         STREAM     CONNECTED     59832    23143/git            
unix  3      [ ]         STREAM     CONNECTED     74385    30699/ld-linux-x86-  
unix  3      [ ]         STREAM     CONNECTED     78350    30654/node           
unix  3      [ ]         STREAM     CONNECTED     74384    30654/node           
unix  2      [ ]         STREAM     CONNECTED     82243    30043/git            
unix  2      [ ]         STREAM     CONNECTED     59834    23143/git            
unix  3      [ ]         STREAM     CONNECTED     20747    1376/node            
unix  3      [ ]         STREAM     CONNECTED     739      1523/node            
unix  3      [ ]         STREAM     CONNECTED     98899    -                    
unix  3      [ ]         STREAM     CONNECTED     20748    1523/node            
unix  2      [ ]         STREAM     CONNECTED     709      1647/java            
unix  3      [ ]         STREAM     CONNECTED     98902    -                    
unix  3      [ ]         STREAM     CONNECTED     20745    1376/node            
unix  3      [ ]         STREAM     CONNECTED     20746    1523/node            
unix  3      [ ]         STREAM     CONNECTED     39090    1523/node            
unix  3      [ ]         STREAM     CONNECTED     19454    1989/node            
unix  3      [ ]         STREAM     CONNECTED     20744    1523/node            
unix  3      [ ]         STREAM     CONNECTED     98900    -                    
unix  3      [ ]         STREAM     CONNECTED     740      1854/git             
unix  3      [ ]         STREAM     CONNECTED     20742    1523/node            
unix  3      [ ]         STREAM     CONNECTED     23554    1989/node            
unix  3      [ ]         STREAM     CONNECTED     20743    1376/node            
unix  3      [ ]         STREAM     CONNECTED     39092    1523/node            
unix  3      [ ]         STREAM     CONNECTED     85502    858/node             
unix  3      [ ]         STREAM     CONNECTED     19453    1523/node            
unix  3      [ ]         STREAM     CONNECTED     20741    1376/node            
unix  3      [ ]         STREAM     CONNECTED     39091    9311/node            
unix  3      [ ]         STREAM     CONNECTED     19455    1523/node            
unix  3      [ ]         STREAM     CONNECTED     742      1854/git             
unix  3      [ ]         STREAM     CONNECTED     19262    1662/node            
unix  3      [ ]         STREAM     CONNECTED     20718    1376/node            
unix  3      [ ]         STREAM     CONNECTED     741      1523/node            
unix  3      [ ]         STREAM     CONNECTED     23555    1523/node            
unix  3      [ ]         STREAM     CONNECTED     39096    1523/node            
unix  3      [ ]         STREAM     CONNECTED     19456    1989/node            
unix  3      [ ]         STREAM     CONNECTED     20719    1387/node            
unix  3      [ ]         STREAM     CONNECTED     39097    9311/node            
unix  3      [ ]         STREAM     CONNECTED     19261    1523/node            
unix  3      [ ]         STREAM     CONNECTED     39095    9311/node            
unix  3      [ ]         STREAM     CONNECTED     19258    1662/node            
unix  2      [ ]         STREAM     CONNECTED     20851    2102/git             
unix  3      [ ]         STREAM     CONNECTED     19257    1523/node            
unix  3      [ ]         STREAM     CONNECTED     39094    1523/node            
unix  3      [ ]         STREAM     CONNECTED     19260    1662/node            
unix  3      [ ]         STREAM     CONNECTED     85503    874/node             
unix  3      [ ]         STREAM     CONNECTED     98901    -                    
unix  3      [ ]         STREAM     CONNECTED     85506    875/node             
unix  2      [ ]         STREAM     CONNECTED     81798    428/java             
unix  3      [ ]         STREAM     CONNECTED     19259    1523/node            
unix  3      [ ]         STREAM     CONNECTED     19255    1523/node            
unix  2      [ ]         STREAM     CONNECTED     85356    428/java             
unix  3      [ ]         STREAM     CONNECTED     98897    -                    
unix  3      [ ]         STREAM     CONNECTED     19256    1662/node            
unix  3      [ ]         STREAM     CONNECTED     85505    858/node             
unix  3      [ ]         STREAM     CONNECTED     39093    9311/node            
unix  3      [ ]         STREAM     CONNECTED     23556    1989/node            
unix  3      [ ]         STREAM     CONNECTED     23553    1523/node            
unix  3      [ ]         STREAM     CONNECTED     98898    -                    
unix  2      [ ]         STREAM     CONNECTED     20853    2102/git             
unix  2      [ ]         STREAM     CONNECTED     695      1647/java 
simonmaxen-da commented 1 year ago

@GeorgSchneider / @S11001001 - I'll take a look into this.

mziolekda commented 1 year ago

I would have two questions:

GeorgSchneider commented 1 year ago

@mziolekda :

Note that the error seems to be thrown by the JSON API, and related to the websocket connections. But the number of active websocket connections being reported is usually <50 when this happens.

simonmaxen-da commented 1 year ago

I'm able to reproduce this problem. No creation of any contracts requrired, simply start the application and then navigage around.

Lots of websocket connection churn is reported as we go:

25-11-2022 15:53:55.488 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO  com.daml.http.WebSocketService - New websocket client has connected, current number of clients:78, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "8041db02-45d0-4e5c-b0a1-f870b4343e85"}
25-11-2022 15:53:57.937 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO  com.daml.http.WebsocketEndpoints - GOT daml.ws.auth, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "e8efd627-2e95-470c-8253-095362709bce"}
25-11-2022 15:53:57.947 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO  com.daml.http.WebSocketService - New websocket client has connected, current number of clients:79, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "e8efd627-2e95-470c-8253-095362709bce"}
25-11-2022 15:53:58.153 [http-json-ledger-api-akka.actor.default-dispatcher-8] INFO  com.daml.http.WebsocketEndpoints - GOT daml.ws.auth, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "3ed21523-affb-41e7-87b8-d41e4a31b681"}
25-11-2022 15:53:58.156 [http-json-ledger-api-akka.actor.default-dispatcher-8] INFO  com.daml.http.WebSocketService - New websocket client has connected, current number of clients:80, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "3ed21523-affb-41e7-87b8-d41e4a31b681"}
25-11-2022 15:54:07.196 [http-json-ledger-api-akka.actor.default-dispatcher-11] INFO  com.daml.http.WebSocketService - Websocket client has disconnected. Current number of clients: 79, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "d967dd70-b968-4da4-b72d-e2d184718173"}
25-11-2022 15:54:07.216 [http-json-ledger-api-akka.actor.default-dispatcher-13] INFO  com.daml.http.WebSocketService - Websocket client has disconnected. Current number of clients: 78, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "6542975e-01b2-48ca-aa5c-e89b667d6aa3"}
2

If prometheus metrics are enabled with canton then the sandbox_daml_lapi_streams_active metric can be seen to continually increase.

$ curl -s localhost:6200/metrics | grep -v '^#' | sort | grep daml_lapi_streams_active
sandbox_daml_lapi_streams_active 206.0

I thought this could be leak in the interceptor that was failing to clean up closed connections however when the the http-json process is terminated the metric count returns to the base level:

curl -s localhost:6200/metrics | grep -v '^#' | sort | grep daml_lapi_streams_active
sandbox_daml_lapi_streams_active 4.0
simonmaxen-da commented 1 year ago

Lots of messages saying the tracker is shutting down due to inactivity:

INFO  c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Exchange-Default::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO  c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Investor2-Corporate Actions::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO  c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Investor3-Structured Notes::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO  c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(CentralBank-Corporate Actions::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO  c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Seller-Natural Gas::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4, Buyer-Natural Gas::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
simonmaxen-da commented 1 year ago

@mziolekda - as you suspected the tracker key combinations do mount up. key.txt

GeorgSchneider commented 1 year ago

Great that you could reproduct.

We do use an aggressive websocket recycling setting: https://github.com/digital-asset/daml-finance-app/blob/34dcfc2d25d3825878afda77a3e5a5d6308758a8/json-api.conf#L13

Otherwise, when I keep switching users during a demo the active websocket connections quickly approach 500, at which point things break down.

simonmaxen-da commented 1 year ago

@GeorgSchneider - Add this to the canton config will not fix the unerllying issue but may buy you a bit more breathing space:

participants.sandbox.ledger-api.rate-limit.max-streams = 10000
simonmaxen-da commented 1 year ago

This participant ticket has been created to track the fix on the participant side: https://digitalasset.atlassian.net/browse/DPP-1352

mziolekda commented 1 year ago

It seems to me it, the trackers are responsible for all the outstanding subscriptions, they always go hand in hand:

zgrep -i "Received request for .*completion subs" canton.log.2022-11-25.gz | wc
    572   13906  379778
zgrep  -i "Registered a tracker" canton.log.2022-11-25.gz | wc
    509    8868  188288

The one bit I don't understand is why does it get cleaned up when the connection to json is closed

mziolekda commented 1 year ago

It is not the completions. We have 998 requests for transactions:

2022-11-29 14:19:34,679 [canton-env-execution-context-21] INFO  c.d.p.a.s.t.ApiTransactionService - Received request for transactions., context: {participant: "sandbox", startExclusive: "000000000000000377", endInclusive: null, filters: {Investor1-Corporate Actions::12207e0be19b39add9576d114c58b5f334a524301eedc56b2a0b3aeabccc757c5254: [], Public-Corporate Actions::12207e0be19b39add9576d114c58b5f334a524301eedc56b2a0b3aeabccc757c5254: []}, verbose: true, ledgerId: "sandbox"}

Those subscriptions are never terminated. That has changed in 2.5 is that ledger api server now pays attention how many streams are kept open at any one time and if you exceed 1000, it refuses to service another.

mziolekda commented 1 year ago

It is enough to leave the finance app on the login screen. Iit will issue 20 stream requests every minute

I have added stream termination logging to the ledger api server and can confirm that the streams are never closed https://github.com/digital-asset/daml/pull/12687/files

bame-da commented 1 year ago

@mziolekda @simonmaxen-da I understand that once this error is hit, the Ledger API doesn't recover from it. Is that just because the connections never get freed up, or this this generally a fatal error?

I assume the answer is "not fatal, the ledger api will recover when connections get closed", but I want to be sure.

S11001001 commented 1 year ago

I assume the answer is "not fatal, the ledger api will recover when connections get closed", but I want to be sure.

My local testing in #15733 confirms this; restarting the json-api itself without restarting the sandbox restores things to working state.

S11001001 commented 1 year ago

I'm also occasionally seeing this one, but not sure it's related:

It's not related. #15922