Potential Memory Leak - Githubissues

MillsyBot commented 2 years ago

Hey Criteo friends,

We have been using consul-templaterb and we have noticed that overtime the memory utilization for the process slow creeps up and it recently caused some issues for our cluster.

Would like to provide some more information to help in the diagnosis and remediation steps, what would you need to help find the issue?

pierresouchay commented 2 years ago

Hey Criteo friends,

We have been using consul-templaterb and we have noticed that overtime the memory utilization for the process slow creeps up and it recently caused some issues for our cluster.

Would like to provide some more information to help in the diagnosis and remediation steps, what would you need to help find the issue?

Did you tried the --debug-memory-usage flag?

Do you use your own templates or some of the examples provided?

Are you generating several templates at once per instance?

A very common case of leaks is the usage of local variables (starting with @ char)... Are yoy using some?

MillsyBot commented 2 years ago

Did you tried the --debug-memory-usage flag?: Here is the output from --debug-memory-utilization

[MEMORY] 2022-06-16 18:54:02 UTC significant RAM Usage detected
[MEMORY] 2022-06-16 18:54:02 UTC Pages  : 525 (diff 0 aka 0/s) 
[MEMORY] 2022-06-16 18:54:02 UTC Objects: 154462 (diff 6998 aka 683/s)

Do you use your own templates or some of the examples provided? I have written some erbs based off some of the examples provided, would probably have to share the code some other way than GIthub, though.

Are you generating several templates at once per instance? Yes, we have one for a large HAProxy config, one for a map file for HAProxy, and one for the certs

A very common case of leaks is the usage of local variables (starting with @ char)... Are yoy using some? We have none of these variables.

Thanks in advance!

pierresouchay commented 2 years ago

@MillsyBot Ideally, I would :

try first to launch 3 instances: 1 per file with debug memory file, to see if it is linked to 1 of your templates
how much time does it take to use much memory?
finally, there is a prometheus template in the examples, IIRC, it can report RAM usage for the process and some statistics... This might be useful to see correlation between RAM increase and other incidents (errors of rendering, connectivity issues with Consul cluster, increase of number of services due to deployments...)

MillsyBot commented 2 years ago

Here are the results of the 3 different template renderings at start up:

[INFO] First rendering of 1 templates completed in 0.00782881s at 2022-06-17 15:38:42 +0000.
[INFO] File written: WRITTEN[/haproxy/secrets_generator] {:success=>2, :errors=>0, :bytes_read=>4476, :changes=>2, :network_bytes=>179}
[EXEC] Starting process: bash /haproxy/drain_haproxy.sh... on_reload=HUP on_term=TERM, delay between reloads=1s
[EXEC] Starting process: tail -f /dev/null... on_reload=USR2 on_term=USR1, delay between reloads=60.0s
[MEMORY] 2022-06-17 15:38:52 UTC significant RAM Usage detected
[MEMORY] 2022-06-17 15:38:52 UTC Pages  : 268 (diff 15 aka 1/s)
[MEMORY] 2022-06-17 15:38:52 UTC Objects: 73716 (diff -29202 aka -2914/s)

[INFO] First rendering of 1 templates completed in 0.008829001s at 2022-06-17 15:37:55 +0000.
[INFO] File written: WRITTEN[/haproxy/routing.map] {:success=>1, :errors=>0, :bytes_read=>253, :changes=>1, :network_bytes=>253}
[EXEC] Starting process: bash /haproxy/drain_haproxy.sh... on_reload=HUP on_term=TERM, delay between reloads=1s
[EXEC] Starting process: tail -f /dev/null... on_reload=USR2 on_term=USR1, delay between reloads=60.0s
[MEMORY] 2022-06-17 15:38:05 UTC significant RAM Usage detected
[MEMORY] 2022-06-17 15:38:05 UTC Pages  : 268 (diff 15 aka 1/s)
[MEMORY] 2022-06-17 15:38:05 UTC Objects: 73968 (diff -28949 aka -2887/s)

[INFO] File written: WRITTEN[/haproxy/hap.conf] {:success=>2, :errors=>0, :bytes_read=>2594, :changes=>2, :network_bytes=>883}
[EXEC] Starting process: bash /haproxy/drain_haproxy.sh... on_reload=HUP on_term=TERM, delay between reloads=1s
[EXEC] Starting process: tail -f /dev/null... on_reload=USR2 on_term=USR1, delay between reloads=60.0s
[MEMORY] 2022-06-17 15:36:43 UTC significant RAM Usage detected
[MEMORY] 2022-06-17 15:36:43 UTC Pages  : 268 (diff 16 aka 2/s)
[MEMORY] 2022-06-17 15:36:43 UTC Objects: 74038 (diff -28337 aka -2832/s)

Not sure if there is a correlation on significant RAM Usage detected and the actual memory usage over time. What we have observed thus far is that if there are more instances of services that need rendering in the template the faster the memory grows. We are probably leaving some memory allocated between runs. Is there away to force GC on instances?

Thanks for the hint on the prometheus endpoint I will take a look soon.

pierresouchay commented 2 years ago

@MillsyBot If I remember correctly, RAM is regularily outputed... So let those 3 instances running for the weeked, and we will see the results and find the culprit

MillsyBot commented 2 years ago

So we let it run and we pinned it to the first template. You mentioned earlier A very common case of leaks is the usage of local variables (starting with @ char)... Are yoy using some?. We are scheduling jobs and referencing ENV variables as indicated service_tag_filter = ENV['SERVICES_TAG_FILTER'] || 'http' <- we have maybe a handful of these. Could this be an issue?

pierresouchay commented 2 years ago

No, ENV[] is unlikely to leak memory :(

Can you share a part of logic maybe?

MillsyBot commented 2 years ago

We use this statement to override the consul information

<% unless ENV['DEFAULT_GLOBAL_FILE_PATH'].nil? %>
  <%-
    f = open("#{ENV['GLOBAL_CONFIG_FILE_PATH']}", "r")
    global_config_contents = f.read
    f.close
  -%>
<%= global_config_contents %>
<% else %>

We have a handful of these "like" statements

    timeout connect <%= ENV['HAPROXY_TIMEOUT_CONNECT'] || "10s" %>
    timeout client <%= ENV['HAPROXY_TIMEOUT_CLIENT'] || "10s" %>
    timeout server <%= ENV['HAPROXY_TIMEOUT_SERVER'] || "60s" %>

Some of this type and then we refernce these variables

  route_map_path = ENV['ROUTING_MAP_PATH'] || "/haproxy/routing.map"
  route_map_lookup = ENV['ROUTE_MAP_LOOKUP'] || "req.hdr(host)"
  map_match_type = ENV['MAP_MATCH_TYPE'] || "map"

This is the "big one" we parse through all the tags for the given service and generate a hash. This is hash is then queried on the services for backend info, like if an ACL should be applied or if the backend is proto h2 or not.

  def gen_definition(prefix, tags)
    definition = Hash.new
    if tags.include?("#{prefix}.protocol=https") || tags.include?("#{prefix}.tags=https")
      definition["http_https_or_grpc"] = "https"
    ( much more processing, but similar style of tag inspection )

  return definition

  services(tag: service_tag_filter).each do |service_name, tags|
    definition = gen_definition(prefix, tags)

pierresouchay commented 2 years ago

A few comments from my phone:

Open/close, bad for perf and performance, for your use-case, consider using includes (see examples)
take care of functions defined in the code, the ruby interpreter constantly redefine those, might leak some stuff on interpreter side , use variables to define function once, see def in examples such as in timeline IIRC
env[something] | value is safe

I'll have a more in depth look tomorrow

pierresouchay commented 2 years ago

@MillsyBot did you find a way?

MillsyBot commented 2 years ago

We have built some of more logic on top of the examples that are in this repo. We do a great deal of "searching" for tags that are defined by users, making this almost a "Load Balancing as a Service" based on Consul defined tags... e.g. if they have "stickyrouting=true" then we render bits and pieces of the config to match.

We tried to take everything down to just 1 template that only rendered Vault secrets and we still saw memory creeping up over time. It might be the version of Ruby we are running. In the Dockerfile you have defined ruby 2.5 is this the version you suggest using for this project?

pierresouchay commented 2 years ago

@MillsyBot we did not use much (i mean in very large setups with many endpoints) the vault endpoint... I wonder if issue might not be there... Could you try running 2 templates: 1 with vault endpoint, the other with a hardcoded data instead and compare the memory usage?

MillsyBot commented 1 year ago

We decided to test running our binaries inside the container env that is provided in this repo and it seems like the leak we are tracking is built into the version we are using. Long and short: your container=no leak, our container=leak. Thanks keeping the issue around this long!

criteo / consul-templaterb

Potential Memory Leak #87