Closed eagletmt closed 4 years ago
Good news: I did another experiment to see if #171 also fixes the memory leak or not. In our environment, #171 solved the memory leak successfully. I manually applied the patch to one of "w/ async-http" aggregator nodes and it showed steady memory usage.
Very nice graphs.
I've confirmed this issue was fixed in v1.8.3.
When I upgraded td-agent in our production workload to v4.0.1, which bundles fluent-plugin-prometheus v1.8.2, the memory usage started to grow. To investigate further, I divided our aggregator nodes into two groups: w/ async-http and w/o async-http. In aggregator nodes w/o async-http, I ran
sudo /opt/td-agent/bin/fluent-gem uninstall async async-http async-io async-pool
to disable async implementation in fluent-plugin-prometheus. Aggregator nodes w/ async-http use the default td-agent v4.0.1 package. Our Prometheus instances scrape fluentd metrics from /aggregated_metrics endpoint with 15 seconds interval.The result looks like below. Aggregator nodes w/ async-http show increasing memory usage while aggregator nodes w/o async-http show steady one.