Open RaduBerinde opened 1 week ago
The doc documenting the reproduction steps should lend itself rather well to translation into a roachtest that would also do a good job verifying that krb5 works end to end. We could start up the cluster, run a workload for ten minutes while also performing connections in a tight loop, then parse the jemalloc profile output against an allowlist of allocations.
For example, here's a bad node at the time of writing (krb5 leak):
(jeprof) top10
Total: 25725874 objects
16176774 62.9% 62.9% 16176774 62.9% prof_backtrace_impl
3303852 12.8% 75.7% 9726464 37.8% krb5_authdata_context_init
1638412 6.4% 82.1% 1638412 6.4% authind_request_init
1638412 6.4% 88.5% 1638412 6.4% krb5int_open_plugin_dirs
1507374 5.9% 94.3% 1507374 5.9% s4u2proxy_request_init
1455010 5.7% 100.0% 17647805 68.6% kg_duplicate_name
6038 0.0% 100.0% 6038 0.0% _cgo_c900cce5b7d6_Cfunc_calloc
0 0.0% 100.0% 6 0.0% _GLOBAL__sub_I_eh_alloc.cc
0 0.0% 100.0% 6 0.0% _GLOBAL__sub_I_eh_alloc.cc (inline)
0 0.0% 100.0% 6 0.0% __libc_csu_init
Here's a good one:
Total: 680 objects
673 99.0% 99.0% 673 99.0% _cgo_c900cce5b7d6_Cfunc_calloc
6 1.0% 100.0% 6 1.0% prof_backtrace_impl
0 0.0% 100.0% 6 1.0% _GLOBAL__sub_I_eh_alloc.cc
0 0.0% 100.0% 6 1.0% _GLOBAL__sub_I_eh_alloc.cc (inline)
0 0.0% 100.0% 6 1.0% __libc_csu_init
0 0.0% 100.0% 6 1.0% __libc_start_main@GLIBC_2.2.5
0 0.0% 100.0% 6 1.0% __static_initialization_and_destruction_0 (inline)
0 0.0% 100.0% 6 1.0% _start
0 0.0% 100.0% 6 1.0% imalloc (inline)
0 0.0% 100.0% 6 1.0% imalloc_body (inline)
This looks like a good amount of meaningful verification can be achieved.
We could also throw in some geospatial usage (surely we have some test suite we can run) to guard against regression of https://github.com/cockroachdb/cockroach/pull/98740 - note how that PR does not add any test coverage. This would increase the scope of this issue to catch any unexpected cgo memory usage.
We recently discovered a memory leak when a connection is authenticated via kerberos (see #130273).
We should add a roachtest that keeps establishing and destroying connections. The roachtest could perhaps verify the cgo memory stats to check for leaks. Or perhaps it can be set up with low memory nodes, where we get OOM without the fix in #130273.
CC @tbg
Jira issue: CRDB-41964