MarcJHuber / event-driven-servers

A collection of event-driven servers (currently: tac_plus, tac_plus-ng, ftpd, tcprelay)
https://www.pro-bono-publico.de/projects
Other
100 stars 25 forks source link

tac_plus Resets Requests When Minimum Instances Are Too Low Due to Delayed Scaling Response #121

Open codesica opened 1 month ago

codesica commented 1 month ago

Hello Marc,

Problem Description:

After running tac_plus stably for about a week with approximately 6,000+ NAS devices, we observed that tac_plus started resetting requests. At that time, the number of TCP persistent connections was around 300, and there was no significant scaling up of instances.

Steps to Reproduce:

Resolution:

Conclusion:

We suspect that when tac_plus is configured with a small number of instances, the system dynamically adjusts the number of worker processes based on request volume and the number of connections. However, when the minimum number of instances is set too low, the scaling mechanism may not respond promptly.

Screenshots and Configuration:

netstat -anp | grep tac_plus | wc -l
image
 spawn = {
        instances min = 1
        instances max = 100
    }
MarcJHuber commented 1 month ago

Hi,

thanks for reporting. Increasing the number of instances looks like the right thing to do, great that fixed the issue for you.

A connection reset (or close) would indicate that the overload code kicked in. However, that seems unlikely as the number of connections didn't exceed "users max" * "instances max".

Are you running a recent version? I've actually did modify possibly relevant code as a result of my own load tests about two weeks ago (with tac_plus-ng, but that particular code is shared between tac_plus and -ng), and scaling seemed to work fine.

Cheers,

Marc

codesica commented 1 month ago

Thank you for your response. We are using tac_plus, not tac_plus_ng, and our version is older—definitely earlier than the updates from two weeks ago. Is it possible that in this older version, when the number of instances is set to 1 and the number of long connections increases, the instances cannot scale up on demand?

After we increased the initial number of instances, this issue no longer occurred, although previously it happened sporadically. Could you shed some light on the logic of the related code in this area? Is it possible that when there is only one instance, the scheduling isn’t aggressive enough, leading to misjudgments that prevent the scaling of instances? Or could there be issues with instance idling and the cleanup process?

codesica commented 1 month ago

Additionally, we are currently running in a production environment, so upgrading to the latest version is not convenient. Can increasing the number of instances reliably solve this issue? We need a definitive answer. Are there any potential risks remaining?

MarcJHuber commented 1 month ago

Hi,

you can find a description of the scaling algorithm is is https://projects.pro-bono-publico.de/event-driven-servers/doc/spawnd.html#AEN432 -- basically, "instances min" worker processes are started initially, and additional ones are started until "instances max" is at its limit and all worker processes have "users min" connections. Once that point is reached, connections are equally distributed to all worker processes.

So yes, with 6.000 devices, single connection and 3 connections per device (accounting, authorization, accounting) you're worst case at about concurrent 18.000 connections in total, or 20.000 with some safety margin. I'd just try with "instances min = 50 instances max = 50 users min = 400 users max = 800".

In reality, and possibly with an idle timeout configured, the number of connections you're seeing is likely to be much lower, but with an outdated and possibly misbehaving tac_plus, pre-forking seems to take you to se safe site.

The processes change their title to show the number of connections used. You can easily check with ps how many connections each process handles and take that as a basis for sane min and max values.

Cheers,

Marc

codesica commented 1 month ago

Hi Marc,

Thank you for the detailed explanation. I reviewed the scaling algorithm in the documentation you provided.

I also came across the following note:

If the sticky feature is enabled, spawnd will try to assign connections to server processes based on the remote IP address of the peer. Please note that this will not work in combination with HAProxy.

Currently, we are forwarding connections to tac_plus using the nginx stream module with the PROXY protocol v2 enabled, and we have set accept haproxy = yes. Could this configuration be contributing to the issue?

Best regards,

MarcJHuber commented 1 month ago

Hi,

sticky mode is off by default, so a correlation to your issue seems unlikely. Even if sticky mode would be enabled, it would just fall back to non-sticky on overload. Sticky mode isn't compatible to HA Proxy, but that shouldn't matter. I honestly don't know whether there could be a problem with earlier versions, so I'd just git pull the current code, install it it in a lab environment and then check whether connection handling is fine.

Cheers,

Marc

codesica commented 1 month ago

Hi Marc,

Thank you for the additional information.

We are currently using the version downloaded via wget from this path: http://www.pro-bono-publico.de/projects/src/DEVEL.tar.bz2. How can we compare this version with the latest one? If we need to upgrade to the latest version, how can we achieve a smooth upgrade, especially since we’re now running in a production environment? How can we avoid potential issues? Could you please provide more explicit guidance regarding version updates?

Best regards,

MarcJHuber commented 1 month ago

Hi,

"tac_plus -v" will output version information. There's currently no tar.bz distribution file, this will jus

I care for backwards compatibility, so older configurations should just a) parse correctly and b) work. You can check configurations before installing by running

./build/*/fakeroot/usr/local/sbin/tac_plus -P path-to-your-config

Cheers,

Marc

codesica commented 1 month ago

Hi Marc,

Here is the output when checking the service status:

tac_plus.service - TACACS+ Service Loaded: loaded (/etc/systemd/system/tac_plus.service; enabled> Active: active (running) since Mon 2024-10-21 12:26:04 CST; 2> Process: 2930231 ExecReload=/bin/kill -HUP $MAINPID (code=exit> Main PID: 2887290 (tac_plus) Tasks: 666 (limit: 201948) Memory: 567.4M CGroup: /system.slice/tac_plus.service ├─ 2845493 "tac_plus: 1 connection" "" "" "" "" "" ""> ├─ 2850559 "tac_plus: 4 connections" "" "" "" "" "" "> ├─ 2852230 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852231 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852232 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852233 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852338 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852346 /usr/local/bin/tac_plus_plugin_v1 ├─ 2852353 /usr/local/bin/tac_plus_plugin_v1 ├─ 2853767 /usr/local/bin/tac_plus_plugin_v1 ├─ 2853768 /usr/local/bin/tac_plus_plugin_v1 ├─ 2853769 /usr/local/bin/tac_plus_plugin_v1 ├─ 2853770 /usr/local/bin/tac_plus_plugin_v1 ├─ 2853792 /usr/local/bin/tac_plus_plugin_v1

As shown, when checking the service status, we have the main tac_plus process and worker processes, and we can see the number of connections for each process. There are also multiple Mavis backend tac_plus_plugin_v1 instances.

My questions are:

•   What is the scaling relationship between these processes?
•   Why does one tac_plus worker process create multiple Mavis backend processes?

I would appreciate any insights you can provide.

Best regards,

MarcJHuber commented 1 month ago

Hi,

regarding process relationship: A single master process accepts connections and forwards them to one of its worker processes. I've already given an in-depth dive about the algorithm behind that.

About the MAVIS backend processes: Any worker process using the "external" MAVIS module has a number of child processes (childs min ... childs max) to interface to a compliant backend program, which apparently is your closed-source "tac_plus_plugin_v1". Please don't expect me to support that.

Cheers,

Marc

codesica commented 1 month ago

Hi Marc,

When the tac_plus connection count is extremely high—for example, we see entries like 883596 "tac_plus: 2013 connections"—and we reload the service, it takes about 5 minutes for these connections to be released, or they are released very slowly. How can we avoid this issue?

Additionally, even though the tac_plus process and TCP port 49 are still active, the service is unable to handle TACACS requests. Normally, NAS devices can switch between primary and secondary servers through configuration, but in this case, authentication fails directly. There is no proper failover, and it doesn’t recognize network timeouts or other issues.

Is there a way to detect such problems in advance? For example, after an overload, can we stop the process or close the port to avoid a “zombie” state? What are some effective methods to handle this situation?

Best regards,

MarcJHuber commented 1 month ago

Hi,

there's really no way around upgrading to the recent code. There's a fair chance that this would fix the issue you're seeing.

Yes, the only alternative I can see is to kill -9 the master process and then restart it.

Please set up the current version in a lab environment and give it a try. This is a voluntary project with limited resources, and I simply don't have the time to support arbitrary older versions.

Thanks,

Marc

codesica commented 1 month ago

Hi Marc,

We are now encountering a new issue:

1.  We are using the latest code.
2.  The initial number of instances is set to min=5 and max=100.
3.  During peak periods, over approximately one hour, the total number of authentication, authorization, and accounting requests is around 16,000.
4.  After that, the CPU usage of the 5 tac_plus instances remains at 100%, and even after 3 hours, it has not recovered.
5.  However, it can still handle new requests without any blockage.
6.  What could be the possible reasons for this?

I have found some related configuration parameters, such as instance numbers, user numbers, retire limits, and timeouts. How can we perform systematic tuning using these settings?

•   instances min and instances max
•   users min and users max
•   retire limit 
•   retire timeout 
•   connection timeout 
•   context timeout 

Our main requirement is to run stably, and we have scheduled tac_plus to restart every night. How can we make the system more stable and robust?

I would appreciate any guidance on optimizing these configurations to improve stability.

Best regards,

MarcJHuber commented 1 month ago

Hi,

I've tried to reproduce the issue you're seeing but failed to do so. Just to confirm, are you testing just with concurrent TCP connections or actual TACACS+ connections? Also, what are your "users min/max" values? Is this on Linux or something else? The default event mechanism for Linux is epoll, on the BSDs it's kqueue.

Also, you can configure a "retire limit" (or retire timeout, if that's more suitable) to cause a worker process to terminate voluntarily after the specified number of connections (or seconds).

If one of the worker processes goes to 100% CPU again, please try attaching a debug backtrace (gdb (or lldb) -p , then "bt", or strace -p . This looks very much like an issue with the event loop.

Thanks,

Marc

codesica commented 1 month ago
  1. Stress Testing:

    • All requests are TACACS protocol requests.
    • Performed stress testing using the following command:
      tac_perf -username test0001 -password xxxx -address 10.10.10.10:49 -network tcp -secret xxxxx -count 2000 -concurrent 100 -timeout 10
  2. System Call Analysis:

    • Used strace -p to monitor the tac_plus processes.
    • Observed a large number of epoll_wait loops, indicating intensive I/O event polling.
  3. Configuration Effectiveness:

    • Verified that retire limit and timeout settings are effective through testing.
    • Question: Are there any other side effects that might be influencing the high CPU usage and slow connection releases despite these settings being effective?
MarcJHuber commented 1 month ago

Hi,

thanks again for reporting. I was guessing that there's be an epoll() loop, I've seen that in earlier versions, mostly while testing. I don't know the reason for that, but https://github.com/MarcJHuber/event-driven-servers/commit/f9d971b1fad230400dc412ae9a238b9a1ba85039 might help, eventually.

The "retire" statements just abandon a worker process to avoid memory or CPU hogs. What you're seeing is an unidentified bug. "retire" should have no side effects.

I've did a quick search for the "tac_perf" program you've mentioned, is this publicly available?

Thanks

Marc

codesica commented 1 month ago

Hi Marc,

Thank you for the additional information.

Service Restart and Reload Performance:

codesica commented 1 month ago
image
MarcJHuber commented 1 month ago

Hi,

regarding reload vs. restart: reload (via SIGHUP) causes the daemon to close the connection to its worker processes. It then restarts itself from scratch. Restart is likely using a SIGTERM, which causes the daemon to exit, and it needs to restarted externally. The worker processes will continue to server the existing connections, but new connections will obviously require the master process to be up again.

I've identified a possible deadlock situation a couple of days ago, with both master and worker processes stalling in sendmsg(). This should be ok now. I don't know whether it's related to any of the issues you're seeing.

Cheers,

Marc

MarcJHuber commented 1 month ago

Hi,

thanks for offering you tac_perf tool for testing. While I'd obviously prefer a open source variant of that I've failed to reproduce the epoll loop issue with my own (Perl based) performance testing environment, so it would be great to give your tac_perf tool a try.

Cheers,

Marc