drt: consider having long-running sessions with some simple workload

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.87k stars 3.77k forks source link

drt: consider having long-running sessions with some simple workload #122155

Closed yuzefovich closed 4 days ago

yuzefovich commented 5 months ago

In order for us to reproduce and catch some bugs (e.g. #121844), it might require to have sessions that are running for weeks and only close when the nodes restart. We cannot really replicate such scenario neither in CI nor in roachtests, and the DRT cluster seems like a perfect fit. We should consider introducing a simple workload that would run continuously / periodically on such long-running-session.

Jira issue: CRDB-37733

yuzefovich commented 1 month ago

[triage] does tpcc workload already do this for us? do we run it long enough on DRT cluster? Should DRT team look into this overall?

[michae2] concerned that spot instances that would be going down frequently would be killing long-living sessions preventing implementing a workload like this issue describes.

cc @BabuSrithar @srosenberg

srosenberg commented 1 month ago

We observed on one of the CC clusters that a session that issued 3M txns has "memory usage" reported as around 400MiB.

For the regression test, would it suffice to assert on the memory usage of each long-lived session? Using crdb_internal.node_memory_monitors for observability?

yuzefovich commented 1 month ago

For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).

srosenberg commented 1 month ago

For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).

Yep, it makes sense! I was just confirming the general test strategy. This is a great candidate for long-running clusters. What about perturbations? A long-running cluster will inevitably experience external (and internal) failures; this would make it trickier to keep persistent sessions. (I assume a disconnect would be a deal breaker; i.e., would it resolve the memory leak in this case?)

yuzefovich commented 1 month ago

Totally, I don't expect any kind of guarantee on the lifetime of these sessions, rather it'd be a good addition to our test suite to have sessions that are as long-lived as possible, on a best-effort basis. Any node restart or network hiccups should be ignored. My hope is that even best-effort with no guarantees would be sufficient to tickle some existing bugs / prevent new regressions.

rytaft commented 3 weeks ago

Not sure why this was added back to SQL Queries. Removing it from our board since it looks like the DRP team is working on it.

csgourav commented 2 weeks ago

cockroach workload has --max-conn-lifetime=1h flag which can be used to increase the connection pool active time, by default it is 5m. This flag governs the session active duration time. We can adjust this time for all workloads running on drt-large.

@rytaft What would be the ideal time you want for the session lifetime, should we increase it to like 12h or you want more than that like 24h or 72h?

rytaft commented 2 weeks ago

Hi @csgourav -- the issue description says "it might require to have sessions that are running for weeks and only close when the nodes restart." So ideally we'd want the sessions to be running as long as possible. Thank you!

blathers-crl[bot] commented 2 weeks ago

cc @cockroachdb/test-eng