데이터베이스를 2개의 HDD RAID로 분리 저장

syncpark commented 9 months ago

Issue

In TIS project, the input traffic is 10Gbps or 20Gbps for each collector machine. Since Giganto cannot not process all events sent by a single Piglet, the Giganto's storage/retrieval performance needs to be improved.

Purpose

Let's improve storage/retrieval performance by storing Conn events in a different HDD RAID than other protocols.

Background

Event ratio by protocols:

Total 5 billions events: Piglet generates this amount of events per day. But this event size only covers about 30% of the total bandwidth.
- Conn events: 4.3 billion (86%)
- Dns events: 0.11 billion (2.2%)
- HTTP events: 0.34 (0.68%)
- TLS events(HTTPS): 0.51 (10.2%)
REconverge manly analyze protocols other than Conn.

If Conn events can be stored separately in a separate HDD RAID, Disk I/O competition with storage and search requests from other protocols is reduced. As a result, We can expect improved performance.

TODOs

Supports setting different DB storage paths for Conn and other protocols
Create and manage Conn DB and other protocol DB separately

sehkone commented 9 months ago

@syncpark I'd like you to collect and organize the issues related to Giganto's performance, so I think the first step is to think about the strategy for improving Giganto's performance in the big picture.

@msk, @sophie-cluml, let's discuss this together.

msk commented 9 months ago

This approach to parallelize the storage of Conn events and other protocols could result in a latency decrease of about 14% (100% - 86%), which is not negligible. However, I have concerns that this alone might not sufficiently address Giganto's scalability issues under heavy traffic.

To tackle the core of the problem, we first need to identify where the bottleneck lies. @syncpark's suggestion hints at the physical disk I/O being the constraint. If that's the case, a potential solution could be to increase the number of stripes in our RAID configuration. This might offer a simpler and possibly more effective way to enhance performance compared to separating Conn and other events.

On the other hand, if the bottleneck is at the level of RocksDB operations, like locking or transaction handling, splitting the events across multiple RocksDB instances on different disks could be beneficial. However, dividing them based on event type may not be the most efficient, particularly when a single type (e.g., Conn) dominates. A more balanced approach could be to distribute events evenly, perhaps using hash values.

Additionally, it’s crucial to consider how much CPU time is currently idle. If we have sufficient CPU resources available, we might explore more aggressive methods. These could include batching events for storage (e.g., storing 1,000 events in a single RocksDB column family entry), compressing events before storage, or implementing both strategies.

aicers / giganto