Closed ShawnLeung87 closed 5 days ago
Is there anything unusual about this information?
LLS/6 2024-09-29 14:50:34 NOTICE 14 log entries were suppressed during the last ratelimit interval LLS/6 2024-09-29 14:52:44 NOTICE 1 log entries were suppressed during the last ratelimit interval LLS/6 2024-09-29 15:10:06 NOTICE 3 log entries were suppressed during the last ratelimit interval DYC/9 2024-09-29 16:20:02 NOTICE 2253 log entries were suppressed during the last ratelimit interval DYC/9 2024-09-29 16:20:02 DEBUG Client disconnected DYC/9 2024-09-29 16:20:16 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:20:16 DEBUG Client disconnected DYC/9 2024-09-29 16:20:47 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:20:47 DEBUG Client disconnected DYC/9 2024-09-29 16:28:23 WARNING The reply message length (66228) exceeds the limit DYC/9 2024-09-29 16:28:23 DEBUG Client disconnected LLS/6 2024-09-29 16:41:09 NOTICE 13 log entries were suppressed during the last ratelimit interval
When a syn attack reaches 1.2g, it will cause a core request exception and cannot be self-repaired. For example, the gK/12 core in the log. Gatekeeper does not output other exception logs. Can I add a capture exception log?
GK/10 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 261504, tot_pkts_size = 52950752, pkts_num_granted = 233078, pkts_size_granted = 51077663, pkts_num_request = 16743, pkts_size_request = 1389218, pkts_num_declined = 11683, pkts_size_declined = 841531, tot_pkts_num_dropped = 11683, tot_pkts_size_dropped = 841531, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 91327/250000000=0.0%]
GK/12 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 204384, tot_pkts_size = 34149584, pkts_num_granted = 157174, pkts_size_granted = 30962472, pkts_num_request = 12, pkts_size_request = 1318, pkts_num_declined = 4092, pkts_size_declined = 301435, tot_pkts_num_dropped = 47196, tot_pkts_size_dropped = 3188043, tot_pkts_num_distributed = 2, tot_pkts_size_distributed = 191, flow_table_occupancy = 2090/250000000=0.0%]
GK/13 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 279324, tot_pkts_size = 70155332, pkts_num_granted = 251581, pkts_size_granted = 68355450, pkts_num_request = 16550, pkts_size_request = 1371755, pkts_num_declined = 11193, pkts_size_declined = 781707, tot_pkts_num_dropped = 11193, tot_pkts_size_dropped = 781707, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 92126/250000000=0.0%]
GK/11 2024-09-29 19:21:44 NOTICE Basic measurements [tot_pkts_num = 286965, tot_pkts_size = 60032833, pkts_num_granted = 259053, pkts_size_granted = 57889233, pkts_num_request = 16590, pkts_size_request = 1380076, pkts_num_declined = 11322, pkts_size_declined = 1117064, tot_pkts_num_dropped = 11322, tot_pkts_size_dropped = 1117064, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0, flow_table_occupancy = 88763/250000000=0.0%]
The following logs appear in grantor
GT/8 2024-09-29 19:29:37 NOTICE 352 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:38 NOTICE 362 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:38 NOTICE 354 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:42 NOTICE 337 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:43 NOTICE 360 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:43 NOTICE 389 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:43 NOTICE 369 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:47 NOTICE 353 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:48 NOTICE 334 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:48 NOTICE 323 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:48 NOTICE 376 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:52 NOTICE 379 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:53 NOTICE 400 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:29:53 NOTICE 375 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:53 NOTICE 364 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:29:58 NOTICE 102 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:29:58 NOTICE 106 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:29:58 NOTICE 83 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:00 NOTICE 111 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:05 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:05 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:08 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:10 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:15 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:21 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:21 NOTICE 1 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:23 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/6 2024-09-29 19:30:37 NOTICE 5 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:38 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/5 2024-09-29 19:30:39 NOTICE 2 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:41 NOTICE 5 log entries were suppressed during the last ratelimit interval
GT/8 2024-09-29 19:30:50 NOTICE 3 log entries were suppressed during the last ratelimit interval
GT/7 2024-09-29 19:30:53 NOTICE 2 log entries were suppressed during the last ratelimit interval
There's not enough information here for me to make any hypothesis. You need to describe what you built, how you tested it, and be more specific; for example, I don't know what you mean by "it will cause a core request exception and cannot be self-repaired."
Below are just comments on the logs:
NOTICE N log entries were suppressed during the last ratelimit interval
mean that more log is being generated than the configuration allows to be saved. Having a low to moderate amount of log is necessary to keep disk I/O low; too much disk I/O will reduce the capacity of Gatekeeper to process packets.DYC/9 2024-09-29 16:20:02 DEBUG Client disconnected
are normal. This log entry should only be seen in test environments since it's DEBUG
level.DYC/9 2024-09-29 16:20:16 WARNING The reply message length (66228) exceeds the limit
means that the request passed to the Dynamic Configuration block returned more than 64KiB of data and it was truncated. You may need to edit your request to avoid this situation.At present, the phenomenon in my production is that the number of requests initiated by gatekeeper to grantor drops to 0, which can be seen from the pkts_num_request log of "Basic measurements". The exception occurs in less than 15 minutes of running time. This phenomenon began to occur in September, and we did not change the code during this period. Hardware problems are also suspected, but several gatekeepers have this situation at the same time, which basically excludes hardware problems. There is no memory segmentation error in the gatekeeper and grantor logs, and the server's kernel and system logs are also normal. This situation cannot be checked through external logs. I wonder if it is possible to capture exceptions between gatekeeper and grantor communication?
Are you running RC2 on Ubuntu 24.04? Are you running Gatekeeper as it is present in our repository, that is, without patching it?
Have you tracked a testing flow when the problem occurs to see what the state of the flow is?
What does BGP see during the failure condition? Have you checked the FIB entries of the type GK_FWD_GRANTOR
during the failure?
I wonder if it is possible to capture exceptions between gatekeeper and grantor communication?
If a Gatekeeper server is not sending packets to Grantor servers, what do you expect to capture?
4. After the error of issue#703 occurs, this situation becomes frequent.
That issue reports a kernel issue. If the kernel runs into trouble, there is nothing an application can do to guarantee progress.
5. Restart gatekeeper and it will run normally for 15 minutes. Then the number of pkts_num_requests per core will start to decrease until it drops to 0.
Does pkts_num_requests
go to zero even when the kernel issue above does not come up?
Post the log entries with "Basic measurements" when pkts_num_requests
is zero. Knowing all the other measurements may provide a clue.
The 24.04 ubuntu version uses new code (the current version). The 20.04 ubuntu version uses production code (the July version). This problem occurs in all tests.
GK/10 2024-10-02 11:16:05 NOTICE Basic measurements [tot_pkts_num = 1960, tot_pkts_size = 151591, pkts_num_granted = 2, pkts_size_granted = 196, pkts_num_request = 0, pkts_size_request = 0, pkts_num_declined = 0, pkts_size_declined = 0, tot_pkts_num_dropped = 1957, tot_pkts_size_dropped = 151290, tot_pkts_num_distributed = 1, tot_pkts_size_distributed = 105, flow_table_occupancy = 2/250000000=0.0%]
- When the problem occurred, I used gkctl to query fib
What did you get?
- I want to know whether gatekeeper is communicating with grantor ggu port normally
What's normal depends on the configuration and the flows arriving. I understand what you are asking, but the answer to your question may also answer the whole problem here.
- When pkts_num_requests is 0 in "Basic measurements", drop all packets
GK/10 2024-10-02 11:16:05 NOTICE Basic measurements [tot_pkts_num = 1960, tot_pkts_size = 151591, pkts_num_granted = 2, pkts_size_granted = 196, pkts_num_request = 0, pkts_size_request = 0, pkts_num_declined = 0, pkts_size_declined = 0, tot_pkts_num_dropped = 1957, tot_pkts_size_dropped = 151290, tot_pkts_num_distributed = 1, tot_pkts_size_distributed = 105, flow_table_occupancy = 2/250000000=0.0%]
The fact that the flow table is almost empty, packets are arriving, no packets from new flows, and almost all packets are being dropped suggests a configuration issue at the FIB or BGP.
The fib check and bgp session are normal, and the routes can be learned normally. This can rule out fib and bgp problems.
I added 1128 prefixes to fib, but only 401 results were found when I searched fib. Is this normal? Or is the return value too long?
You don't mention how you added the 1128 prefixes (e.g., did you individually verify that each prefix was successfully added?), nor how you measured the 401 prefixes you found. Therefore, I have to guess that the request's output was truncated.
Have you tried the script gkctl/scripts/summarize_fibs.lua
?
I output the table returned by show_fib.lua to text, and it shows that there are only 401 entries in prefixs, which may be because the characters are too long and have been truncated. The number of fibs output by using summarize_fibs.lua is correct. I want to use accurate ip segment query to confirm whether the ip prefix does not exist in fib when an exception occurs. However, the output of show_fib.lua is incomplete fib and cannot find the ip prefix I tested.
I wrote the script filter_fib.txt to filter any prefix that starts with "10.7.". You can edit this script to filter whatever you want from the FIB. You can read the code of the function lua/gatekeeper/dylib.lua:print_fib_dump_entry()
to find out how to access the other fields of a FIB entry. My script is a variation of the script gkctl/scripts/show_fib_grantor.lua
.
I encourage you to flesh out your testbed and work to reproduce this problem on your testbed. Investigating this kind of problem in a testbed is a lot less stressful. The effort to flesh out your testbed may seem too much. Still, this effort will pay dividends when dealing with other issues, testing future Gatekeeper versions, or validating configuration changes.
After October 10, I could not find the abnormal problem that occurred in September. After October 10, all operations were stable. On October 10, a redundant router in the backbone (it is not connected to the gatekeeper but the data packets may pass through it) was repaired. I am not sure whether it is caused by the faulty card of this router, because when the data packets do not go through the gatekeeper, all data packets are normal. After the card was repaired, various test cases were tested, and the data packets passing through the gatekeeper were normal. I still need to spend time observing to find out the specific cause.
At present, I have upgraded to the latest version in production, and the operating system has also been replaced with the new version of the operating system. From October 10 (after repairing the backbone router) to now, it has been running well. In September, the new version also had the problem that the data packet could not reach the backend destination IP. This abnormal problem has not been reproduced for the time being. I will continue to observe and update the progress
On November 10, this problem reappeared. I adjusted it as follows to restore it. Since hyperthreading was enabled, the physical core performance was split into two logical cores. Then the CPU was set to energy-saving mode, which caused the CPU core frequency to be severely reduced, resulting in a decrease in the speed of CPU execution instructions, frequent interrupts, and gatekeeper abnormalities.
The CPU interruption problem caused by suboptimal server CPU settings has been resolved. Close this issue
1.2dev version Under what circumstances does the gatekeeper and grantor request strategy fail? Recently, the number of grantor request packets in our production has been 0. All packets have been dropped. No abnormal logs are seen, only the gatekeeper log shows that the request packet is 0, and the total number of packets is the same as the number of dropped packets.