Description

A golang-based netflow collector with a flexible backend.

A list of upcoming features can be found under the issue tracker for this project.

Currently supported frontend/backend combinations are

Frontend	Backend
Netflow	Mysql
Netflow	Timescaledb
Netflow	Apache Kafka

Prereqs

You need a running backend and associated connection information;

Server fqdn
Username
Password
Database name

For Mysql, you could use the free tier of Amazon RDS to get started.

See the SETUP.md instructions within the backend directories in this repo for help specific to each backend. (i.e ./backends/timescale/SETUP.MD)

Installation

Goflow requires two files to run;

goflow, the binary itself
config.yml, the configuration file

The tar releases contain both these.

# Extract and set perms
tar -xzvf goflow.Linux.AMD64.tar.gz
chmod +x goflow
mv config_example.yml config.yml
# Edit the config.yml file to make specific to your environment
vi config.yml
# Export the required environment variables
export SQL_PASSWORD=your_sql_pw_here
# Run
./goflow

In the future, an installation script will be packaged for most systems but for now, you will need to create your own systemd or init scripts to start it.

Monitoring and utilities

The goflow binary doubles as a client interface, and a JSON API is started at the same time as the daemon.

The API listens on localhost by default, but this can be tuned (see the configuration example).

The Goflow API is not for retrieving flow data, but performing ongoing maintenence and ops on the collector itself.

Goflow help displays a list of options.

./goflow help

Integrations

Grafana

Goflow integrates natively (i.e - no plugins required) with Grafana when using the Timescale backend type.

Grafana transforms the underlying Postgres database into a set of pretty graphs!

Note: dummy data shown

For convenience, Goflow provides the Dashboards (/grafana_db/*.json) and some code to setup Grafana correctly.

You need:

A timescale backend, already configured in config.yml
A running grafana instance
An API key

With these requirements met, the dashboards/datasources can be setup as below.

# Make sure the required env-var for timescale is exported
export SQL_PASSWORD=your-sql-password
./goflow configure-grafana http://[ your-grafana-server ] [ your-api-key ] [ dashboard-directory ]

Performance

Benchmarks

Each release of Goflow is benchmarked in a test environment.

Currently the most efficient backend is timescaledb.

The environment setup is;

Type: AWS T2.Micro
CPU: 1vCPU
Memory: 1GB
Storage: EBS SSD (Non-provisioned)

Both network latency and storage have a large impact on performance. The benchmarks above are running with Goflow on the same server as the backends.

Notes on tuning and hardware requirements

Netflow, unsurprisingly, generates a lot of data.

It's unreasonable to try and estimate compute and storage requirements ahead of time, as this sort of thing is hard to quantify, as it's entirely based on how many flows you're exporting which you probably don't already know!

Instead, first decide what your goals are for the data. Specifically, decide how much data you want to store, then what timeframe you want to be able to query quickly and finally, what constitutes "quickly."

After you have that decided, understand how increasing hardware attributes affects each decision:

More memory allows for more caching, which allows you to run short time range queries very efficiently. In a real environment, doubling the memory of a timescaledb instance reduced a SELECT query runtime by more than 10x.
More cores will make complicated sorts, joins, and other SQL manipulations faster when reading from memory.
Faster storage improves the speed of queries that cannot be cached or are not yet cached.

In practice? You should give your SQL server access to an amount of shared memory equal to the amount of data that fits in the timeframe you would like to query quickly. If it is unreasonable to fit that amount of data into memory you need to increase storage READ speeds.

Don't forget to actually tune your database after installation (we've all done it...)! Timescale offers a super good utility for doing it automatically: https://github.com/timescale/timescaledb-tune

Example

To illustrate the above points, imagine example.corp wants to store 6 months of netflow data. They would like to query 24 hours worth as quickly as possible to use on their auto-refreshing wallboards in the office, which refresh once every 30 seconds.

From experimentation, they run at 2k flows per second average with each flow attributed to approximately 150 bytes/flow on disk.

(300B2000)86400 = 25GB/day

A reccomended hardware setup would be; CPU: 6-8 cores Memory: 32GB Minimum Storage: 4.5TB of disk benchmarked to at least 100MB/s read.

Environment variables

Below are a list of all the supported environment variables and the scope in which they are relevant.

Envar	Scope	Purpose
GOFLOW_CONFIG	*	Path to configuration file (config.yml)
SQL_PASSWORD	Timescaledb, mysql	SQL password
KAFKA_SERVER	Kafka	Kafka server
KAFKA_TOPIC	Kafka	Topic to publish to
SSL	Kafka	SSL Enabled/disabled
SSL_VERIFY	Kafka	SSL Verfification
SASL_USER	Kafka	Kafka username
SASL_PASSWORD	Kafka	Kafka password

adambaumeister / goflow

readme