cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

roachtest: add a kerberos stress test #130274

Open RaduBerinde opened 1 week ago

RaduBerinde commented 1 week ago

We recently discovered a memory leak when a connection is authenticated via kerberos (see #130273).

We should add a roachtest that keeps establishing and destroying connections. The roachtest could perhaps verify the cgo memory stats to check for leaks. Or perhaps it can be set up with low memory nodes, where we get OOM without the fix in #130273.

CC @tbg

Jira issue: CRDB-41964

tbg commented 1 week ago

The doc documenting the reproduction steps should lend itself rather well to translation into a roachtest that would also do a good job verifying that krb5 works end to end. We could start up the cluster, run a workload for ten minutes while also performing connections in a tight loop, then parse the jemalloc profile output against an allowlist of allocations.

For example, here's a bad node at the time of writing (krb5 leak):

(jeprof) top10
Total: 25725874 objects
16176774  62.9%  62.9% 16176774  62.9% prof_backtrace_impl
 3303852  12.8%  75.7%  9726464  37.8% krb5_authdata_context_init
 1638412   6.4%  82.1%  1638412   6.4% authind_request_init
 1638412   6.4%  88.5%  1638412   6.4% krb5int_open_plugin_dirs
 1507374   5.9%  94.3%  1507374   5.9% s4u2proxy_request_init
 1455010   5.7% 100.0% 17647805  68.6% kg_duplicate_name
    6038   0.0% 100.0%     6038   0.0% _cgo_c900cce5b7d6_Cfunc_calloc
       0   0.0% 100.0%        6   0.0% _GLOBAL__sub_I_eh_alloc.cc
       0   0.0% 100.0%        6   0.0% _GLOBAL__sub_I_eh_alloc.cc (inline)
       0   0.0% 100.0%        6   0.0% __libc_csu_init

Here's a good one:

Total: 680 objects
     673  99.0%  99.0%      673  99.0% _cgo_c900cce5b7d6_Cfunc_calloc
       6   1.0% 100.0%        6   1.0% prof_backtrace_impl
       0   0.0% 100.0%        6   1.0% _GLOBAL__sub_I_eh_alloc.cc
       0   0.0% 100.0%        6   1.0% _GLOBAL__sub_I_eh_alloc.cc (inline)
       0   0.0% 100.0%        6   1.0% __libc_csu_init
       0   0.0% 100.0%        6   1.0% __libc_start_main@GLIBC_2.2.5
       0   0.0% 100.0%        6   1.0% __static_initialization_and_destruction_0 (inline)
       0   0.0% 100.0%        6   1.0% _start
       0   0.0% 100.0%        6   1.0% imalloc (inline)
       0   0.0% 100.0%        6   1.0% imalloc_body (inline)

This looks like a good amount of meaningful verification can be achieved.

We could also throw in some geospatial usage (surely we have some test suite we can run) to guard against regression of https://github.com/cockroachdb/cockroach/pull/98740 - note how that PR does not add any test coverage. This would increase the scope of this issue to catch any unexpected cgo memory usage.