agoragames / kairos

Python module for time series data in Redis and Mongo
BSD 3-Clause "New" or "Revised" License
207 stars 38 forks source link

Cassandra optimizations #51

Open awestendorf opened 10 years ago

awestendorf commented 10 years ago

Reverse order of Cassandra primary key? See this presentation around this time

C* Summit 2013: The World's Next Top Data Model https://www.youtube.com/watch?v=HdJlsOZVGwM&list=PLqcm6qE9lgKJoSWKYWHWhrVupRbS8mmDA#t=1432

awestendorf commented 10 years ago

Looking at the presentation more, it seems that perhaps this should be configurable, so that users have the option based on how they're using the data.

awestendorf commented 10 years ago

Also look to changing the primary key to use a composite partition on (name,interval). This will still leave open the opportunity to reach the row width (2 billion), and so it might be necessary to use (name, interval, i_time, r_time). The downside of adding time to the composite key will be that range queries are impacted. May require this as a configuration option, which I'm not too hot on, but I'm coming to understand that Cassandra requires up-front data modeling that is based on use case.

awestendorf commented 10 years ago

One concern about putting all the data in a single row is how that will affect performance over time. See the following video at around the 10:00 mark, where rows spread across SSTables are discussed.

C* Summit 2013: How Not to Use Cassandra https://www.youtube.com/watch?v=0u-EKJBPrj8#t=600

awestendorf commented 10 years ago

Same video as above around the 12:00 mark, there is discussion about how size-tiered compaction can be a good strategy for timeseries data. Consider using that as the default.

http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction

However, around 25:00-30:00, the presenter discusses how tombstones will not be deleted in various cases, especially with size-tiered compaction. This is relevant to using TTLs.

awestendorf commented 10 years ago

Same video, at 46:00, a note about how writing to the same row over and over again will lead to bad performance, implying that (name,interval) for a partition is not a good idea.