Closed yuzefovich closed 4 days ago
[triage] does tpcc workload already do this for us? do we run it long enough on DRT cluster? Should DRT team look into this overall?
[michae2] concerned that spot instances that would be going down frequently would be killing long-living sessions preventing implementing a workload like this issue describes.
cc @BabuSrithar @srosenberg
We observed on one of the CC clusters that a session that issued 3M txns has "memory usage" reported as around 400MiB.
For the regression test, would it suffice to assert on the memory usage of each long-lived session? Using crdb_internal.node_memory_monitors
for observability?
For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).
For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).
Yep, it makes sense! I was just confirming the general test strategy. This is a great candidate for long-running clusters. What about perturbations? A long-running cluster will inevitably experience external (and internal) failures; this would make it trickier to keep persistent sessions. (I assume a disconnect would be a deal breaker; i.e., would it resolve the memory leak in this case?)
Totally, I don't expect any kind of guarantee on the lifetime of these sessions, rather it'd be a good addition to our test suite to have sessions that are as long-lived as possible, on a best-effort basis. Any node restart or network hiccups should be ignored. My hope is that even best-effort with no guarantees would be sufficient to tickle some existing bugs / prevent new regressions.
Not sure why this was added back to SQL Queries. Removing it from our board since it looks like the DRP team is working on it.
cockroach workload has --max-conn-lifetime=1h
flag which can be used to increase the connection pool active time, by default it is 5m. This flag governs the session active duration time. We can adjust this time for all workloads running on drt-large.
@rytaft What would be the ideal time you want for the session lifetime, should we increase it to like 12h or you want more than that like 24h or 72h?
Hi @csgourav -- the issue description says "it might require to have sessions that are running for weeks and only close when the nodes restart." So ideally we'd want the sessions to be running as long as possible. Thank you!
cc @cockroachdb/test-eng
In order for us to reproduce and catch some bugs (e.g. #121844), it might require to have sessions that are running for weeks and only close when the nodes restart. We cannot really replicate such scenario neither in CI nor in roachtests, and the DRT cluster seems like a perfect fit. We should consider introducing a simple workload that would run continuously / periodically on such long-running-session.
Jira issue: CRDB-37733