aicers / giganto

Raw-Event Storage System for AICE
Apache License 2.0
6 stars 3 forks source link

Packet processing should be handled in a separate thread #888

Open syncpark opened 2 weeks ago

syncpark commented 2 weeks ago

테스트 중에 Giganto가 다운되어 추적하다가 이런 상황을 확인했습니다.

Systemctl을 이용해서 Giganto를 실행한 후 UI로부터 다수의 패킷을 조회 요청이 들어오면 아래 첨부한 로그와 같은 상황이 발생하지 않을까 추정하고 있습니다.

tcpdump를 이용해서 패킷 정보를 추출하는 기능을 개별 쓰레드로 분리하는게 어떨까요?

Moon_aice@moon:/data/giganto/export/log.pcap.1730281649$ systemctl status giganto.service
● giganto.service - Giganto Engine
     Loaded: loaded (/etc/systemd/system/giganto.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-11-07 13:42:36 KST; 2h 4min ago
    Process: 787260 ExecStart=/usr/local/aice/scripts/run_giganto.sh (code=exited, status=0/SUCCESS)
   Main PID: 787270 (giganto-0.22.1)
      Tasks: 53 (limit: 459274)
     Memory: 521.1M
     CGroup: /system.slice/giganto.service
             ├─787270 /usr/local/aice/bin/giganto-0.22.1 -c /usr/local/aice/conf/giganto.toml --cert /usr/local/aice/CA/giganto_cert.pem --key /usr/local/aice/CA/giganto_key>
             ├─866980 tcpdump -n -X -tttt -v -r -
             ├─867090 tcpdump -n -X -tttt -v -r -
             ├─874556 tcpdump -n -X -tttt -v -r -
             ├─874894 tcpdump -n -X -tttt -v -r -
             ├─876336 tcpdump -n -X -tttt -v -r -
             └─876827 tcpdump -n -X -tttt -v -r -

Nov 07 13:54:04 moon run_giganto.sh[804637]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:20:18 moon run_giganto.sh[866980]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:20:31 moon run_giganto.sh[867090]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:23:00 moon run_giganto.sh[873947]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:23:57 moon run_giganto.sh[874556]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:24:31 moon run_giganto.sh[874894]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:26:50 moon run_giganto.sh[876336]: reading from file -, link-type EN10MB (Ethernet)
Nov 07 15:27:39 moon run_giganto.sh[876827]: reading from file -, link-type EN10MB (Ethernet)
sehkone commented 2 weeks ago

As @syncpark pointed out, we anticipated this problem when we adopted tcpdump for this feature. While we haven't found a better alternative to replace tcpdump, I think we could try what @syncpark suggested for now.

@msk, could you share your insight on this?

msk commented 2 weeks ago

I'm not sure how separating the packet info extraction into a separate thread would mitigate the memory issue. If the problem is indeed caused by a lack of memory due to multiple child threads not finishing, wouldn't we still run out of memory even if the tcpdump operation is in a separate thread?

Before we start refactoring, I think we should try to understand the memory requirements of Giganto under the expected load. How much memory do we expect to need to support the number of concurrent requests we're anticipating?

It would be great to get some more data on the memory usage patterns of Giganto and see if we can identify any bottlenecks or areas where we can optimize memory usage. Once we have a better understanding of the memory requirements, we can start discussing potential solutions, including increasing resources, optimizing memory usage, or implementing other mitigations.

syncpark commented 2 weeks ago

@msk @sehkone I was thinking of a way for the tcpdump thread to process only one request at a time, like a FIFO queue. I think the problem will be solved to some extent by preventing situations where multiple tcpdumps are running simultaneously.

syncpark commented 2 weeks ago

@msk @sehkone In the short term, other solutions are considered.

Currently, tcpdump is executed with the following parameters

The problem is that the pcap file is too large if the pcap is extracted from the http traffic with large attached file. Over one thousand packets are included for the 1.2MB http session.

Instead of processing full packets, I think that the first 10 packets are enough to show the contents of the pcap on Web UI.

sehkone commented 2 weeks ago

I'm not sure how separating the packet info extraction into a separate thread would mitigate the memory issue. If the problem is indeed caused by a lack of memory due to multiple child threads not finishing, wouldn't we still run out of memory even if the tcpdump operation is in a separate thread?

Before we start refactoring, I think we should try to understand the memory requirements of Giganto under the expected load. How much memory do we expect to need to support the number of concurrent requests we're anticipating?

It would be great to get some more data on the memory usage patterns of Giganto and see if we can identify any bottlenecks or areas where we can optimize memory usage. Once we have a better understanding of the memory requirements, we can start discussing potential solutions, including increasing resources, optimizing memory usage, or implementing other mitigations.

I assume the team might have some usual suspects causing this problem, but is yet to reach concrete evidence or actual data. However, we should gather some facts to identify viable alternatives. I believe @syncpark and the team will make an effort to establish test results we can analyze. What @syncpark is now suggesting is likely a quick fix to work around this until more robust solutions are planned.

msk commented 1 week ago

Thanks for clarifying the situation. Considering this is a temporary solution, I'd like to propose an alternative approach that minimizes code changes.

What if we introduce a static flag that indicates whether the system is busy processing a request? In PacketQuery::pcap, we can check this flag before processing a new request. If the flag is set, we reject the request. Otherwise, we set the flag, process the request, and then unset the flag when completed.

This approach could potentially be a simple way to address the issue. Additionally, if needed, we can easily modify this approach to use a counter instead of a flag, allowing us to accept requests up to a certain threshold.

I'm open to either this approach or @syncpark's original proposal.