[Kaldi/Speechrecognition] Potential memory "leak"?

Related to Model/Framework(s) Kaldi/Speechrecognition

Describe the bug There seems to be a memory leak in the Kaldi triton backend since the memory usage progressively grows with each inference perform

To Reproduce Steps to reproduce the behavior:

Launch the server with ./scripts/docker/launch_server.sh
Note the memory usage, idling at ~5GB
Launch the demo client with ./scripts/docker/launch_client.sh -p
Watch the memory usage rise to ~18G and never return back even after the client has exited
Run the client 4-5 times again, and notice the ram usage rise to ~19.5G

With a custom client based on the demo client, running an inference for a ~200MB wav file repeatedly, the memory usage rises as follows (for the Nth run). Takes around 3 minutes to get an inference:

1.mem: 6.9Gi
2.mem: 7.6Gi
3.mem: 8.3Gi
4.mem: 9.1Gi
5.mem: 9.8Gi
6.mem: 10Gi
7.mem: 11Gi
8.mem: 11Gi
9.mem: 12Gi
10.mem: 13Gi
11.mem: 14Gi
12.mem: 14Gi
13.mem: 15Gi
14.mem: 16Gi
15.mem: 16Gi
16.mem: 17Gi
17.mem: 18Gi
18.mem: 18Gi
19.mem: 19Gi
20.mem: 20Gi
21.mem: 20Gi
22.mem: 20Gi
23.mem: 20Gi
24.mem: 20Gi
25.mem: 20Gi
26.mem: 21Gi
27.mem: 21Gi
28.mem: 22Gi
29.mem: 22Gi
30.mem: 22Gi
31.mem: 22Gi
32.mem: 23Gi
33.mem: 24Gi
34.mem: 25Gi
35.mem: 25Gi
36.mem: 25Gi
37.mem: 26Gi
38.mem: 26Gi
39.mem: 27Gi
40.mem: 27Gi

Expected behavior Memory usage returns to normal after completing a bulk inference. This behaviour seems to be influenced by the max_active argument. Setting it to a lower value makes the memory usage cap out at a lower RAM usage than the default.

Reading the description of max-active, max-active: at the end of each frame computation, we keep only its best max-active tokens (arc instantiations). It seems as if Kaldi keeps max-active tokens from every frame computation (the processing of each chunk), but this memory doesn't seem to get freed even after all the chunks corresponding to a given correlation ID have been processed. Is this intentional? I might be misunderstanding here but this is just what I could make out from the docs.

Environment Please provide at least:

Container version: nvcr.io/nvidia/tritonserver:21.05-py3 / nvcr.io/nvidia/kaldi:21.08-py3
GPUs in the system: 2x RTX 3090
CUDA driver version: 520.61.05

NVIDIA / DeepLearningExamples

[Kaldi/Speechrecognition] Potential memory "leak"? #1240