adambaumeister / goflow

Golang netflow parsing with flexible storage backend.
GNU General Public License v3.0
55 stars 9 forks source link

Description

A golang-based netflow collector with a flexible backend.

A list of upcoming features can be found under the issue tracker for this project.

Currently supported frontend/backend combinations are

Frontend Backend
Netflow Mysql
Netflow Timescaledb
Netflow Apache Kafka

Prereqs

You need a running backend and associated connection information;

For Mysql, you could use the free tier of Amazon RDS to get started.

See the SETUP.md instructions within the backend directories in this repo for help specific to each backend. (i.e ./backends/timescale/SETUP.MD)

Installation

Goflow requires two files to run;

The tar releases contain both these.

# Extract and set perms
tar -xzvf goflow.Linux.AMD64.tar.gz
chmod +x goflow
mv config_example.yml config.yml
# Edit the config.yml file to make specific to your environment
vi config.yml
# Export the required environment variables
export SQL_PASSWORD=your_sql_pw_here
# Run
./goflow

In the future, an installation script will be packaged for most systems but for now, you will need to create your own systemd or init scripts to start it.

Monitoring and utilities

The goflow binary doubles as a client interface, and a JSON API is started at the same time as the daemon.

The API listens on localhost by default, but this can be tuned (see the configuration example).

The Goflow API is not for retrieving flow data, but performing ongoing maintenence and ops on the collector itself.

Goflow help displays a list of options.

./goflow help

Integrations

Grafana

Goflow integrates natively (i.e - no plugins required) with Grafana when using the Timescale backend type.

Grafana transforms the underlying Postgres database into a set of pretty graphs!

Note: dummy data shown

For convenience, Goflow provides the Dashboards (/grafana_db/*.json) and some code to setup Grafana correctly.

You need:

With these requirements met, the dashboards/datasources can be setup as below.

# Make sure the required env-var for timescale is exported
export SQL_PASSWORD=your-sql-password
./goflow configure-grafana http://[ your-grafana-server ] [ your-api-key ] [ dashboard-directory ]

Performance

Benchmarks

Each release of Goflow is benchmarked in a test environment.

Currently the most efficient backend is timescaledb.

The environment setup is;

Both network latency and storage have a large impact on performance. The benchmarks above are running with Goflow on the same server as the backends.

Notes on tuning and hardware requirements

Netflow, unsurprisingly, generates a lot of data.

It's unreasonable to try and estimate compute and storage requirements ahead of time, as this sort of thing is hard to quantify, as it's entirely based on how many flows you're exporting which you probably don't already know!

Instead, first decide what your goals are for the data. Specifically, decide how much data you want to store, then what timeframe you want to be able to query quickly and finally, what constitutes "quickly."

After you have that decided, understand how increasing hardware attributes affects each decision:

In practice? You should give your SQL server access to an amount of shared memory equal to the amount of data that fits in the timeframe you would like to query quickly. If it is unreasonable to fit that amount of data into memory you need to increase storage READ speeds.

Don't forget to actually tune your database after installation (we've all done it...)! Timescale offers a super good utility for doing it automatically: https://github.com/timescale/timescaledb-tune

Example

To illustrate the above points, imagine example.corp wants to store 6 months of netflow data. They would like to query 24 hours worth as quickly as possible to use on their auto-refreshing wallboards in the office, which refresh once every 30 seconds.

From experimentation, they run at 2k flows per second average with each flow attributed to approximately 150 bytes/flow on disk.

(300B2000)86400 = 25GB/day

A reccomended hardware setup would be; CPU: 6-8 cores Memory: 32GB Minimum Storage: 4.5TB of disk benchmarked to at least 100MB/s read.

Environment variables

Below are a list of all the supported environment variables and the scope in which they are relevant.

Envar Scope Purpose
GOFLOW_CONFIG * Path to configuration file (config.yml)
SQL_PASSWORD Timescaledb, mysql SQL password
KAFKA_SERVER Kafka Kafka server
KAFKA_TOPIC Kafka Topic to publish to
SSL Kafka SSL Enabled/disabled
SSL_VERIFY Kafka SSL Verfification
SASL_USER Kafka Kafka username
SASL_PASSWORD Kafka Kafka password