Allow for use of compact storage in Cassandra

aww-yiss commented 8 years ago

Hello, in working with Disthene over the past few months, we've found that we have a bit of a problem of storage within Cassandra. On current load is on the order of ~20Million unique metrics per minute, and some of our teams are (for reasons) submitting metrics under a path name which can be in excess of 500 bytes long.

The schema that is currently in place (per the recommended version) causes this to store a HUGE amount of data within the row key for each data point, and as such once the C* overhead is factored in, this results in a quickly consuming cluster disk space ( >40TB in < 60 days )

So investigating options a bit it seems this schema could be updated to make use of compact storage, as is outline https://docs.datastax.com/en/cql/3.1/cql/cql_reference/create_table_r.html?scroll=reference_ds_v3f_vfk_xj__using-compact-storage

Caveat is this change prevents the data column from being a collection type, which means this is not currently compatible with existing deployments, however the potential disk savings could be tremendous, and potentially help balance out any sstable hotspots.

I'm currently investigating this with a modified schema such as:

CREATE TABLE disthene.metric (
    tenant text,
    period int,
    rollup int,
    path text,
    time bigint,
    data double,
    PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH COMPACT STORAGE
    AND CLUSTERING ORDER BY (time ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.1
    AND speculative_retry = 'NONE';

Note the data value is stored only as a double, which AFAIK should be OK given that these are discrete data points per timestamp, subsequent data points will be stored under the same key

ie:

sstable2json disthene-metric-ka-1-Data.db
[
{"key": "NONE:86400:60:some.kind.of.path.for.a.metric",
 "cells": [["1461334279","42.01",1461335686811528],
           ["1461334379","423.01",1461335755300081],
           ["1461334479","4.01",1461335755313645],
           ["1461334579","432.01",1461335755321461],
           ["1461334679","22.01",1461335755329626],
           ["1461334779","40.04",1461335755338031],
           ["1461334879","419.01",1461335756359523]]}
]

EinsamHauer commented 8 years ago

data field was made a list (in cyanite ) for purpose. The problem is with aggregated metrics. Basically, the problem is that if you have a number of disthenes, say, behind a balancer each of them receives a part of the paths to be aggregated. In which case each of them writes own portion of the sum. And C* doesn't offer anything like UPDATE data=data+? WHERE ...

Another case would be that you receive some metrics to be aggregated much later.

That's the reasoning behind the list

EinsamHauer commented 8 years ago

OFF TOPIC: Using Deflate compression gives quite some savings in terms of disk space

aww-yiss commented 8 years ago

Great point about the Deflate compression, I'm reviewing this now, and may improve the storage issue (may also want to update the README to include this recommendation)

However regarding the load balancing of metrics to Disthene instances, I can see how this would be problematic, however I've addressed this by not using a traditional load balancer in favor of the carbon-c-relay to create a consistent hashing for metrics. This allows you to scale out the number of collectors easily while ensuring that the same metrics which need to be aggregated are delivered to the same collector end point, such that it doesn't affect accuracy of the aggregation.

I'm curious though if you can elaborate more on how aggregations take place in the load balanced model. In this instance of having Disthene instances (say A and B ) behind a load balancer, and assuming each is getting equal traffic round robin to them, how is it then that the instance is able to produce an accurate aggregations from only receiving half the data?

EinsamHauer commented 8 years ago

@aww-yiss

FTR: Deflate compression gives ~25% decrease in storage.
I'm not sure if WITH COMPACT STORAGE gives dramatic win, but would be curious as we are using C* also for other projects. I was reluctant to try it as the use of this option is somehow discouraged by C* guys it seems.
I would also be curious about how it works with carbon relay - does it work well?
As for round robin schema - it's awkward enough. Basically, as I wrote earlier you can't do UPDATE data=data+? WHERE ... if data is double but you can do UPDATE data=data+[123] WHERE ... if data is a list. So, if there are 2 instances of disthene, we'll most likely end up with a list containing two values. The awkward part is that this is supported in disthene-reader - whenever it reads a metric with path like "^sum.*" it reads the whole list and sums the values. For very other path it still reads the whole list but then averages it.

At some point I was thinking about implementing something more sophisticated like clustering, hash ring, whatever. But then it's quite some effort and makes the whole thing much more fragile. And disks are sort of cheap these days.

EinsamHauer commented 8 years ago

@aww-yiss, I've added a note on DeflateCompressor to the README.

Otherwise I'm pretty impressed by 20M/minute - I always thought that we write really a lot in @iponweb but we write slightly less than that.

EinsamHauer / disthene

Allow for use of compact storage in Cassandra #8