apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.82k stars 804 forks source link

Segfault once maxRecords is reached #10892

Open mat1010 opened 11 months ago

mat1010 commented 11 months ago

We are running Trafficserver 9.2.3 and ran into an issue where trafficserver reached the maximum amount of stats and records which is set by maxRecords.

The reason for this is that we are also running podman containers on the same server. Every new container and every restart of a container causes a change of the virtual network interfaces. A new container get's a new interface and a restarted container removes it's current interfaces and gets a new one, with a new name. Every interface creates new records

plugin.system_stats.net.vethfb0aaa00.speed 10000
plugin.system_stats.net.vethfb0aaa00.collisions 0
plugin.system_stats.net.vethfb0aaa00.multicast 0
plugin.system_stats.net.vethfb0aaa00.rx_bytes 71171126
plugin.system_stats.net.vethfb0aaa00.rx_compressed 0
plugin.system_stats.net.vethfb0aaa00.rx_crc_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_dropped 0
plugin.system_stats.net.vethfb0aaa00.rx_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_fifo_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_frame_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_length_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_missed_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_nohandler 0
plugin.system_stats.net.vethfb0aaa00.rx_over_errors 0
plugin.system_stats.net.vethfb0aaa00.rx_packets 983190
plugin.system_stats.net.vethfb0aaa00.tx_aborted_errors 0
plugin.system_stats.net.vethfb0aaa00.tx_bytes 133071338
plugin.system_stats.net.vethfb0aaa00.tx_carrier_errors 0
plugin.system_stats.net.vethfb0aaa00.tx_compressed 0
plugin.system_stats.net.vethfb0aaa00.tx_dropped 0
plugin.system_stats.net.vethfb0aaa00.tx_errors 0
plugin.system_stats.net.vethfb0aaa00.tx_fifo_errors 0
plugin.system_stats.net.vethfb0aaa00.tx_heartbeat_errors 0
plugin.system_stats.net.vethfb0aaa00.tx_packets 1912343
plugin.system_stats.net.vethfb0aaa00.tx_window_errors 0

This would not be an issue if we either could purge the records, not only the values, from time to time (without restarting trafficserver), or the creation of new stats would just not be possible anymore with a corresponding log message. Unfortunately once the value of maxRecords is reached the trafficserver segfaults and does not recover by itself since the traffic_manager process is not getting killed so systemd is not able to handle it with the restart=on-failure directive.

Is this a known issue, or is this the expected bevahiour? Is it save to increase the maxRecords limit to a huge number? What might be the drawbacks?

I attached the crashlogs from systemd and trafficserver systemd.log crash-2023-11-28-164715.log

Thanks in advance

mat1010 commented 11 months ago

Setting --maxRecords leads to other issues and causes the system_stats and remap_stats plugins to stop reporting at all.

ezelkow1 commented 11 months ago

have you tried with -m instead of --maxRecords i.e. ExecStart=/opt/trafficserver/bin/traffic_manager -m 4096 ? We hit the max a couple months back and yea you start to see weird things happen and things breaking, but I used -m in our systemd script to increase the amount and it's been happy with that. Just throwing it out there, maybe some others will have better input :)

mat1010 commented 11 months ago

Thank you @ezelkow1 . Your comment got me into the right direction. Both -m and --maxRecords seem to work. The issue was that my value was too high and my test was invalid. After each modification I checked for the existence of remap_stats in the stats output - but those are only being populated once a request hits a remap rule. Since the server I was testing with is not active in the loadbalancing right now remap_stats where never getting populated... testing with system_stats worked.

bryancall commented 11 months ago

@mat1010 Did you get this issue resolved? If so, can you please close it.

mat1010 commented 11 months ago

@bryancall The main issue, mentioned in my first post, still exists. I'm not sure if segfaulting should be an expected result in case the maxRecords limit is reached.