Closed GeorgSchneider closed 1 year ago
@GeorgSchneider This error is passing through from the participant. Have you seen whether JSON API has >=1000 active TCP connections to the participant when this occurs?
Your log message implies that JSON API is effectively managing 39 connections at the time of error; either the other 961 have been forgotten but still exist, or they really don't exist and have been miscounted elsewhere.
@S11001001 do you have a handy command at hand that I can use to see which process has TCP connections open? I will run it next time I encounter it.
@GeorgSchneider I think you can do netstat -p tcp | grep $PORT
where PORT is your participant server port. ESTABLISHED
connections are active. CLOSE_WAIT
connections are also interesting, if there are very many of them they are a sign of a different problem worth knowing about.
I'm not sure how to filter down to what JSON API is using, but I assume you have nothing else connecting to the participant in your case.
It does not seem to have excessive TCP connections open. Anything else I can check?
Output from netstat -p tcp
:
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 localhost:41298 localhost:33129 ESTABLISHED 1410/node
tcp 0 0 localhost:48502 localhost:33129 ESTABLISHED 30647/node
tcp 0 0 localhost:33129 localhost:48500 ESTABLISHED 1376/node
tcp 0 0 localhost:41300 localhost:33129 ESTABLISHED 1439/node
tcp 0 0 localhost:33129 localhost:41300 ESTABLISHED 1523/node
tcp 0 0 localhost:33129 localhost:48502 ESTABLISHED 30654/node
tcp 0 0 localhost:3000 localhost:49064 ESTABLISHED 858/node
tcp 0 0 localhost:49064 localhost:3000 ESTABLISHED -
tcp 0 0 localhost:48500 localhost:33129 ESTABLISHED 30638/node
tcp 0 0 localhost:33129 localhost:41298 ESTABLISHED 1376/node
tcp6 0 0 localhost:60162 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:6867 localhost:60160 ESTABLISHED 32593/java
tcp6 0 0 localhost:38040 localhost:6865 ESTABLISHED 32593/java
tcp6 0 0 localhost:6865 localhost:38040 ESTABLISHED 32593/java
tcp6 0 0 localhost:60190 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:6867 localhost:60164 ESTABLISHED 32593/java
tcp6 0 0 localhost:60166 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:54090 localhost:35611 ESTABLISHED 1579/ld-linux-x86-6
tcp6 0 0 localhost:38080 localhost:6865 ESTABLISHED 428/java
tcp6 0 0 localhost:6865 localhost:38080 ESTABLISHED 32593/java
tcp6 0 0 localhost:60164 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:6867 localhost:60162 ESTABLISHED 32593/java
tcp6 0 0 localhost:60188 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:38120 localhost:6865 ESTABLISHED 428/java
tcp6 0 0 localhost:6867 localhost:60190 ESTABLISHED 32593/java
tcp6 0 0 localhost:42507 localhost:36406 ESTABLISHED 31205/java
tcp6 0 0 localhost:60160 localhost:6867 ESTABLISHED 32593/java
tcp6 0 0 localhost:6867 localhost:60188 ESTABLISHED 32593/java
tcp6 0 0 localhost:35611 localhost:54090 ESTABLISHED 1647/java
tcp6 0 0 localhost:36406 localhost:42507 ESTABLISHED 30710/ld-linux-x86-
tcp6 0 0 localhost:6865 localhost:38120 ESTABLISHED 32593/java
tcp6 0 0 localhost:38038 localhost:6865 ESTABLISHED 32593/java
tcp6 0 0 localhost:6867 localhost:60166 ESTABLISHED 32593/java
tcp6 0 0 localhost:6865 localhost:38038 ESTABLISHED 32593/java
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node PID/Program name Path
unix 3 [ ] STREAM CONNECTED 78344 30654/node
unix 3 [ ] STREAM CONNECTED 74380 30654/node
unix 2 [ ] STREAM CONNECTED 83260 32593/java
unix 3 [ ] STREAM CONNECTED 74407 30654/node
unix 2 [ ] STREAM CONNECTED 82245 30043/git
unix 3 [ ] STREAM CONNECTED 78353 30665/node
unix 3 [ ] STREAM CONNECTED 48421 19789/node
unix 3 [ ] STREAM CONNECTED 78345 1376/node
unix 3 [ ] STREAM CONNECTED 78348 30654/node
unix 3 [ ] STREAM CONNECTED 74383 30699/ld-linux-x86-
unix 3 [ ] STREAM CONNECTED 78352 1376/node
unix 3 [ ] STREAM CONNECTED 74382 30654/node
unix 2 [ ] STREAM CONNECTED 83093 31205/java
unix 2 [ ] STREAM CONNECTED 74617 31205/java
unix 2 [ ] STREAM CONNECTED 31791 5569/git
unix 3 [ ] STREAM CONNECTED 2719 1568/ld-linux-x86-6
unix 3 [ ] STREAM CONNECTED 74406 30725/git
unix 3 [ ] STREAM CONNECTED 2717 1568/ld-linux-x86-6
unix 2 [ ] STREAM CONNECTED 31793 5569/git
unix 2 [ ] STREAM CONNECTED 78625 32593/java
unix 3 [ ] STREAM CONNECTED 48420 1376/node
unix 3 [ ] STREAM CONNECTED 78347 1376/node
unix 3 [ ] STREAM CONNECTED 74408 30725/git
unix 3 [ ] STREAM CONNECTED 74405 30654/node
unix 3 [ ] STREAM CONNECTED 2716 1523/node
unix 3 [ ] STREAM CONNECTED 2715 1568/ld-linux-x86-6
unix 3 [ ] STREAM CONNECTED 78343 1376/node
unix 3 [ ] STREAM CONNECTED 2718 1523/node
unix 3 [ ] STREAM CONNECTED 78349 1376/node
unix 3 [ ] STREAM CONNECTED 2714 1523/node
unix 3 [ ] STREAM CONNECTED 78346 30654/node
unix 3 [ ] STREAM CONNECTED 74381 30699/ld-linux-x86-
unix 2 [ ] STREAM CONNECTED 59832 23143/git
unix 3 [ ] STREAM CONNECTED 74385 30699/ld-linux-x86-
unix 3 [ ] STREAM CONNECTED 78350 30654/node
unix 3 [ ] STREAM CONNECTED 74384 30654/node
unix 2 [ ] STREAM CONNECTED 82243 30043/git
unix 2 [ ] STREAM CONNECTED 59834 23143/git
unix 3 [ ] STREAM CONNECTED 20747 1376/node
unix 3 [ ] STREAM CONNECTED 739 1523/node
unix 3 [ ] STREAM CONNECTED 98899 -
unix 3 [ ] STREAM CONNECTED 20748 1523/node
unix 2 [ ] STREAM CONNECTED 709 1647/java
unix 3 [ ] STREAM CONNECTED 98902 -
unix 3 [ ] STREAM CONNECTED 20745 1376/node
unix 3 [ ] STREAM CONNECTED 20746 1523/node
unix 3 [ ] STREAM CONNECTED 39090 1523/node
unix 3 [ ] STREAM CONNECTED 19454 1989/node
unix 3 [ ] STREAM CONNECTED 20744 1523/node
unix 3 [ ] STREAM CONNECTED 98900 -
unix 3 [ ] STREAM CONNECTED 740 1854/git
unix 3 [ ] STREAM CONNECTED 20742 1523/node
unix 3 [ ] STREAM CONNECTED 23554 1989/node
unix 3 [ ] STREAM CONNECTED 20743 1376/node
unix 3 [ ] STREAM CONNECTED 39092 1523/node
unix 3 [ ] STREAM CONNECTED 85502 858/node
unix 3 [ ] STREAM CONNECTED 19453 1523/node
unix 3 [ ] STREAM CONNECTED 20741 1376/node
unix 3 [ ] STREAM CONNECTED 39091 9311/node
unix 3 [ ] STREAM CONNECTED 19455 1523/node
unix 3 [ ] STREAM CONNECTED 742 1854/git
unix 3 [ ] STREAM CONNECTED 19262 1662/node
unix 3 [ ] STREAM CONNECTED 20718 1376/node
unix 3 [ ] STREAM CONNECTED 741 1523/node
unix 3 [ ] STREAM CONNECTED 23555 1523/node
unix 3 [ ] STREAM CONNECTED 39096 1523/node
unix 3 [ ] STREAM CONNECTED 19456 1989/node
unix 3 [ ] STREAM CONNECTED 20719 1387/node
unix 3 [ ] STREAM CONNECTED 39097 9311/node
unix 3 [ ] STREAM CONNECTED 19261 1523/node
unix 3 [ ] STREAM CONNECTED 39095 9311/node
unix 3 [ ] STREAM CONNECTED 19258 1662/node
unix 2 [ ] STREAM CONNECTED 20851 2102/git
unix 3 [ ] STREAM CONNECTED 19257 1523/node
unix 3 [ ] STREAM CONNECTED 39094 1523/node
unix 3 [ ] STREAM CONNECTED 19260 1662/node
unix 3 [ ] STREAM CONNECTED 85503 874/node
unix 3 [ ] STREAM CONNECTED 98901 -
unix 3 [ ] STREAM CONNECTED 85506 875/node
unix 2 [ ] STREAM CONNECTED 81798 428/java
unix 3 [ ] STREAM CONNECTED 19259 1523/node
unix 3 [ ] STREAM CONNECTED 19255 1523/node
unix 2 [ ] STREAM CONNECTED 85356 428/java
unix 3 [ ] STREAM CONNECTED 98897 -
unix 3 [ ] STREAM CONNECTED 19256 1662/node
unix 3 [ ] STREAM CONNECTED 85505 858/node
unix 3 [ ] STREAM CONNECTED 39093 9311/node
unix 3 [ ] STREAM CONNECTED 23556 1989/node
unix 3 [ ] STREAM CONNECTED 23553 1523/node
unix 3 [ ] STREAM CONNECTED 98898 -
unix 2 [ ] STREAM CONNECTED 20853 2102/git
unix 2 [ ] STREAM CONNECTED 695 1647/java
@GeorgSchneider / @S11001001 - I'll take a look into this.
I would have two questions:
@mziolekda :
Note that the error seems to be thrown by the JSON API, and related to the websocket connections. But the number of active websocket connections being reported is usually <50 when this happens.
I'm able to reproduce this problem. No creation of any contracts requrired, simply start the application and then navigage around.
Lots of websocket connection churn is reported as we go:
25-11-2022 15:53:55.488 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO com.daml.http.WebSocketService - New websocket client has connected, current number of clients:78, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "8041db02-45d0-4e5c-b0a1-f870b4343e85"}
25-11-2022 15:53:57.937 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO com.daml.http.WebsocketEndpoints - GOT daml.ws.auth, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "e8efd627-2e95-470c-8253-095362709bce"}
25-11-2022 15:53:57.947 [http-json-ledger-api-akka.actor.default-dispatcher-6] INFO com.daml.http.WebSocketService - New websocket client has connected, current number of clients:79, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "e8efd627-2e95-470c-8253-095362709bce"}
25-11-2022 15:53:58.153 [http-json-ledger-api-akka.actor.default-dispatcher-8] INFO com.daml.http.WebsocketEndpoints - GOT daml.ws.auth, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "3ed21523-affb-41e7-87b8-d41e4a31b681"}
25-11-2022 15:53:58.156 [http-json-ledger-api-akka.actor.default-dispatcher-8] INFO com.daml.http.WebSocketService - New websocket client has connected, current number of clients:80, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "3ed21523-affb-41e7-87b8-d41e4a31b681"}
25-11-2022 15:54:07.196 [http-json-ledger-api-akka.actor.default-dispatcher-11] INFO com.daml.http.WebSocketService - Websocket client has disconnected. Current number of clients: 79, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "d967dd70-b968-4da4-b72d-e2d184718173"}
25-11-2022 15:54:07.216 [http-json-ledger-api-akka.actor.default-dispatcher-13] INFO com.daml.http.WebSocketService - Websocket client has disconnected. Current number of clients: 78, context: {instance_uuid: "f48fc230-1bce-464f-bb7a-044068ffcd7e", request_id: "6542975e-01b2-48ca-aa5c-e89b667d6aa3"}
2
If prometheus metrics are enabled with canton then the sandbox_daml_lapi_streams_active
metric can be seen to continually increase.
$ curl -s localhost:6200/metrics | grep -v '^#' | sort | grep daml_lapi_streams_active
sandbox_daml_lapi_streams_active 206.0
I thought this could be leak in the interceptor that was failing to clean up closed connections however when the the http-json process is terminated the metric count returns to the base level:
curl -s localhost:6200/metrics | grep -v '^#' | sort | grep daml_lapi_streams_active
sandbox_daml_lapi_streams_active 4.0
Lots of messages saying the tracker is shutting down due to inactivity:
INFO c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Exchange-Default::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Investor2-Corporate Actions::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Investor3-Structured Notes::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(CentralBank-Corporate Actions::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
INFO c.d.p.a.s.t.TrackerMap - Shutting down tracker for Key(daml-script,Set(Seller-Natural Gas::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4, Buyer-Natural Gas::1220abbd7514ccc56e231d4eb9901abb888840a7e659f52d495cd6b1b8b62be685a4)) after inactivity of PT5M, context: {participant: "sandbox"}
@mziolekda - as you suspected the tracker key combinations do mount up. key.txt
Great that you could reproduct.
We do use an aggressive websocket recycling setting: https://github.com/digital-asset/daml-finance-app/blob/34dcfc2d25d3825878afda77a3e5a5d6308758a8/json-api.conf#L13
Otherwise, when I keep switching users during a demo the active websocket connections quickly approach 500, at which point things break down.
@GeorgSchneider - Add this to the canton config will not fix the unerllying issue but may buy you a bit more breathing space:
participants.sandbox.ledger-api.rate-limit.max-streams = 10000
This participant ticket has been created to track the fix on the participant side: https://digitalasset.atlassian.net/browse/DPP-1352
It seems to me it, the trackers are responsible for all the outstanding subscriptions, they always go hand in hand:
zgrep -i "Received request for .*completion subs" canton.log.2022-11-25.gz | wc
572 13906 379778
zgrep -i "Registered a tracker" canton.log.2022-11-25.gz | wc
509 8868 188288
The one bit I don't understand is why does it get cleaned up when the connection to json is closed
It is not the completions. We have 998 requests for transactions:
2022-11-29 14:19:34,679 [canton-env-execution-context-21] INFO c.d.p.a.s.t.ApiTransactionService - Received request for transactions., context: {participant: "sandbox", startExclusive: "000000000000000377", endInclusive: null, filters: {Investor1-Corporate Actions::12207e0be19b39add9576d114c58b5f334a524301eedc56b2a0b3aeabccc757c5254: [], Public-Corporate Actions::12207e0be19b39add9576d114c58b5f334a524301eedc56b2a0b3aeabccc757c5254: []}, verbose: true, ledgerId: "sandbox"}
Those subscriptions are never terminated. That has changed in 2.5 is that ledger api server now pays attention how many streams are kept open at any one time and if you exceed 1000, it refuses to service another.
It is enough to leave the finance app on the login screen. Iit will issue 20 stream requests every minute
I have added stream termination logging to the ledger api server and can confirm that the streams are never closed https://github.com/digital-asset/daml/pull/12687/files
@mziolekda @simonmaxen-da I understand that once this error is hit, the Ledger API doesn't recover from it. Is that just because the connections never get freed up, or this this generally a fatal error?
I assume the answer is "not fatal, the ledger api will recover when connections get closed", but I want to be sure.
I assume the answer is "not fatal, the ledger api will recover when connections get closed", but I want to be sure.
My local testing in #15733 confirms this; restarting the json-api itself without restarting the sandbox restores things to working state.
I'm also occasionally seeing this one, but not sure it's related:
It's not related. #15922
Affected Daml version
2.5.0-snapshot.20221107.10906.0.346bb489 (possibly earlier ones as well, but I've not seen it with 2.5.0-snapshot.20221010.10736.0.2f453a14)
Bug description
After running and using the
daml-finance-app
for a while I'm getting errors relating to websocket streams:I'm also occasionally seeing this one, but not sure it's related:
The app gets unresponsive after seeing the first error, so it seems that it's not just wrong reporting of nr of streams.
I'm not sure if this was already present in a previous snapshot, I've only seen it happen now on the latest (version above).
To reproduce
localhost:3000
in 3 or 4 tabs, quickly, until the error appears in thedaml start
log ,Expected behavior
As said, this seems to be related to some recent changes, as I've never seen this on older snapshots or <= 2.4.x