1.2dev version Under what circumstances does the gatekeeper and grantor request strategy fail?

ShawnLeung87 commented 1 month ago

1.2dev version Under what circumstances does the gatekeeper and grantor request strategy fail? Recently, the number of grantor request packets in our production has been 0. All packets have been dropped. No abnormal logs are seen, only the gatekeeper log shows that the request packet is 0, and the total number of packets is the same as the number of dropped packets.

ShawnLeung87 commented 1 month ago

Is there anything unusual about this information?

LLS/6 2024-09-29 14:50:34 NOTICE 14 log entries were suppressed during the last ratelimit interval LLS/6 2024-09-29 14:52:44 NOTICE 1 log entries were suppressed during the last ratelimit interval LLS/6 2024-09-29 15:10:06 NOTICE 3 log entries were suppressed during the last ratelimit interval DYC/9 2024-09-29 16:20:02 NOTICE 2253 log entries were suppressed during the last ratelimit interval DYC/9 2024-09-29 16:20:02 DEBUG Client disconnected DYC/9 2024-09-29 16:20:16 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:20:16 DEBUG Client disconnected DYC/9 2024-09-29 16:20:47 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:20:47 DEBUG Client disconnected DYC/9 2024-09-29 16:28:23 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:28:23 DEBUG Client disconnected LLS/6 2024-09-29 16:41:09 NOTICE 13 log entries were suppressed during the last ratelimit interval

ShawnLeung87 commented 1 month ago

When a syn attack reaches 1.2g, it will cause a core request exception and cannot be self-repaired. For example, the gK/12 core in the log. Gatekeeper does not output other exception logs. Can I add a capture exception log?

GK/10 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 261504, tot_pkts_size = 52950752, pkts_num_granted = 233078, pkts_size_granted = 51077663, pkts_num_request = 16743, pkts_size_request = 1389218, pkts_num_declined = 11683, pkts_size_declined = 841531, tot_pkts_num_dropped = 11683, tot_pkts_size_dropped = 841531, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 91327/250000000=0.0%]
GK/12 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 204384, tot_pkts_size = 34149584, pkts_num_granted = 157174, pkts_size_granted = 30962472, pkts_num_request = 12, pkts_size_request = 1318, pkts_num_declined = 4092, pkts_size_declined = 301435, tot_pkts_num_dropped = 47196, tot_pkts_size_dropped = 3188043, tot_pkts_num_distributed = 2, tot_pkts_size_distributed = 191, flow_table_occupancy = 2090/250000000=0.0%]
GK/13 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 279324, tot_pkts_size = 70155332, pkts_num_granted = 251581, pkts_size_granted = 68355450, pkts_num_request = 16550, pkts_size_request = 1371755, pkts_num_declined = 11193, pkts_size_declined = 781707, tot_pkts_num_dropped = 11193, tot_pkts_size_dropped = 781707, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 92126/250000000=0.0%]
GK/11 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 286965, tot_pkts_size = 60032833, pkts_num_granted = 259053, pkts_size_granted = 57889233, pkts_num_request = 16590, pkts_size_request = 1380076, pkts_num_declined = 11322, pkts_size_declined = 1117064, tot_pkts_num_dropped = 11322, tot_pkts_size_dropped = 1117064, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 88763/250000000=0.0%]

ShawnLeung87 commented 1 month ago

The following logs appear in grantor

GT/8 2024-09-29 19:29:37 NOTICE 352 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:38 NOTICE 362 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:38 NOTICE 354 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:42 NOTICE 337 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:43 NOTICE 360 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:43 NOTICE 389 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:43 NOTICE 369 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:47 NOTICE 353 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:48 NOTICE 334 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:48 NOTICE 323 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:48 NOTICE 376 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:52 NOTICE 379 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:53 NOTICE 400 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:53 NOTICE 375 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:53 NOTICE 364 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:58 NOTICE 102 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:58 NOTICE 106 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:58 NOTICE 83 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:00 NOTICE 111 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:05 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:05 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:08 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:10 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:15 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:21 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:21 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:23 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:37 NOTICE 5 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:38 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:39 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:41 NOTICE 5 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:50 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:53 NOTICE 2 log entries were suppressed during the last ratelimit interval

AltraMayor commented 1 month ago

There's not enough information here for me to make any hypothesis. You need to describe what you built, how you tested it, and be more specific; for example, I don't know what you mean by "it will cause a core request exception and cannot be self-repaired."

Below are just comments on the logs:

Log entries like NOTICE N log entries were suppressed during the last ratelimit interval mean that more log is being generated than the configuration allows to be saved. Having a low to moderate amount of log is necessary to keep disk I/O low; too much disk I/O will reduce the capacity of Gatekeeper to process packets.
Log entries like DYC/9 2024-09-29 16:20:02 DEBUG Client disconnected are normal. This log entry should only be seen in test environments since it's DEBUG level.
The log entry DYC/9 2024-09-29 16:20:16 WARNING The reply message length (66228) exceeds the limit means that the request passed to the Dynamic Configuration block returned more than 64KiB of data and it was truncated. You may need to edit your request to avoid this situation.

ShawnLeung87 commented 1 month ago

At present, the phenomenon in my production is that the number of requests initiated by gatekeeper to grantor drops to 0, which can be seen from the pkts_num_request log of "Basic measurements". The exception occurs in less than 15 minutes of running time. This phenomenon began to occur in September, and we did not change the code during this period. Hardware problems are also suspected, but several gatekeepers have this situation at the same time, which basically excludes hardware problems. There is no memory segmentation error in the gatekeeper and grantor logs, and the server's kernel and system logs are also normal. This situation cannot be checked through external logs. I wonder if it is possible to capture exceptions between gatekeeper and grantor communication?

The bpf problem has been ruled out, and the default grantedv2 of the project has been used for testing, and this situation also occurs.
The system update problem has been ruled out, and the operating system has been reinstalled, but the same situation still occurs.
The hardware problem has been ruled out, and the hardware detection is normal. And several gatekeepers have exceptions at the same time, and the probability of hardware exceptions is very low.
After the error of issue#703 occurs, this situation becomes frequent.
Restart gatekeeper and it will run normally for 15 minutes. Then the number of pkts_num_requests per core will start to decrease until it drops to 0.

AltraMayor commented 1 month ago

Are you running RC2 on Ubuntu 24.04? Are you running Gatekeeper as it is present in our repository, that is, without patching it?

Have you tracked a testing flow when the problem occurs to see what the state of the flow is?

What does BGP see during the failure condition? Have you checked the FIB entries of the type GK_FWD_GRANTOR during the failure?

I wonder if it is possible to capture exceptions between gatekeeper and grantor communication?

If a Gatekeeper server is not sending packets to Grantor servers, what do you expect to capture?

4. After the error of issue#703 occurs, this situation becomes frequent.

That issue reports a kernel issue. If the kernel runs into trouble, there is nothing an application can do to guarantee progress.

5. Restart gatekeeper and it will run normally for 15 minutes. Then the number of pkts_num_requests per core will start to decrease until it drops to 0.

Does pkts_num_requests go to zero even when the kernel issue above does not come up?

Post the log entries with "Basic measurements" when pkts_num_requests is zero. Knowing all the other measurements may provide a clue.

ShawnLeung87 commented 1 month ago

The 24.04 ubuntu version uses new code (the current version). The 20.04 ubuntu version uses production code (the July version). This problem occurs in all tests.

When the problem occurred, I used gkctl to query fib
I want to know whether gatekeeper is communicating with grantor ggu port normally
When pkts_num_requests is 0 in "Basic measurements", drop all packets GK/10 2024-10-02 11:16:05 NOTICE Basic measurements [tot_pkts_num = 1960, tot_pkts_size = 151591, pkts_num_granted = 2, pkts_size_granted = 196, pkts_num_request = 0, pkts_size_request = 0, pkts_num_declined = 0, pkts_size_declined = 0, tot_pkts_num_dropped = 1957, tot_pkts_size_dropped = 151290, tot_pkts_num_distributed = 1, tot_pkts_size_distributed = 105, flow_table_occupancy = 2/250000000=0.0%]

AltraMayor commented 1 month ago

When the problem occurred, I used gkctl to query fib

What did you get?

I want to know whether gatekeeper is communicating with grantor ggu port normally

What's normal depends on the configuration and the flows arriving. I understand what you are asking, but the answer to your question may also answer the whole problem here.

When pkts_num_requests is 0 in "Basic measurements", drop all packets GK/10 2024-10-02 11:16:05 NOTICE Basic measurements [tot_pkts_num = 1960, tot_pkts_size = 151591, pkts_num_granted = 2, pkts_size_granted = 196, pkts_num_request = 0, pkts_size_request = 0, pkts_num_declined = 0, pkts_size_declined = 0, tot_pkts_num_dropped = 1957, tot_pkts_size_dropped = 151290, tot_pkts_num_distributed = 1, tot_pkts_size_distributed = 105, flow_table_occupancy = 2/250000000=0.0%]

The fact that the flow table is almost empty, packets are arriving, no packets from new flows, and almost all packets are being dropped suggests a configuration issue at the FIB or BGP.

ShawnLeung87 commented 1 month ago

The fib check and bgp session are normal, and the routes can be learned normally. This can rule out fib and bgp problems.

ShawnLeung87 commented 1 month ago

I added 1128 prefixes to fib, but only 401 results were found when I searched fib. Is this normal? Or is the return value too long?

AltraMayor commented 1 month ago

You don't mention how you added the 1128 prefixes (e.g., did you individually verify that each prefix was successfully added?), nor how you measured the 401 prefixes you found. Therefore, I have to guess that the request's output was truncated.

Have you tried the script gkctl/scripts/summarize_fibs.lua?

ShawnLeung87 commented 1 month ago

I output the table returned by show_fib.lua to text, and it shows that there are only 401 entries in prefixs, which may be because the characters are too long and have been truncated. The number of fibs output by using summarize_fibs.lua is correct. I want to use accurate ip segment query to confirm whether the ip prefix does not exist in fib when an exception occurs. However, the output of show_fib.lua is incomplete fib and cannot find the ip prefix I tested.

AltraMayor commented 1 month ago

I wrote the script filter_fib.txt to filter any prefix that starts with "10.7.". You can edit this script to filter whatever you want from the FIB. You can read the code of the function lua/gatekeeper/dylib.lua:print_fib_dump_entry() to find out how to access the other fields of a FIB entry. My script is a variation of the script gkctl/scripts/show_fib_grantor.lua.

I encourage you to flesh out your testbed and work to reproduce this problem on your testbed. Investigating this kind of problem in a testbed is a lot less stressful. The effort to flesh out your testbed may seem too much. Still, this effort will pay dividends when dealing with other issues, testing future Gatekeeper versions, or validating configuration changes.

ShawnLeung87 commented 3 weeks ago

After October 10, I could not find the abnormal problem that occurred in September. After October 10, all operations were stable. On October 10, a redundant router in the backbone (it is not connected to the gatekeeper but the data packets may pass through it) was repaired. I am not sure whether it is caused by the faulty card of this router, because when the data packets do not go through the gatekeeper, all data packets are normal. After the card was repaired, various test cases were tested, and the data packets passing through the gatekeeper were normal. I still need to spend time observing to find out the specific cause.

ShawnLeung87 commented 3 weeks ago

At present, I have upgraded to the latest version in production, and the operating system has also been replaced with the new version of the operating system. From October 10 (after repairing the backbone router) to now, it has been running well. In September, the new version also had the problem that the data packet could not reach the backend destination IP. This abnormal problem has not been reproduced for the time being. I will continue to observe and update the progress

ShawnLeung87 commented 6 days ago

On November 10, this problem reappeared. I adjusted it as follows to restore it. Since hyperthreading was enabled, the physical core performance was split into two logical cores. Then the CPU was set to energy-saving mode, which caused the CPU core frequency to be severely reduced, resulting in a decrease in the speed of CPU execution instructions, frequent interrupts, and gatekeeper abnormalities.

Disable CPU hyperthreading in bios
BIOS sets the CPU mode to performance mode
GRUB needs to be set up in this way to isolate the dpdk core from the kernel to avoid kernel-running applications from preempting GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=6-20 nohz_full=6-20 rcu_nocbs=6-20 intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=256 intel_idle.max_cstate=0 processor.max_cstate=0 intel_pstate=disable rcu_nocb_poll audit=0 nosoftlockup mce=ignore_ce"

ShawnLeung87 commented 5 days ago

The CPU interruption problem caused by suboptimal server CPU settings has been resolved. Close this issue

AltraMayor / gatekeeper

1.2dev version Under what circumstances does the gatekeeper and grantor request strategy fail? #713