JeremyGrosser / tablesnap

Uses inotify to monitor Cassandra SSTables and upload them to S3
BSD 2-Clause "Simplified" License
181 stars 86 forks source link

Freshen keys & upload a tokens.yaml if requested #87

Closed juiceblender closed 7 years ago

juiceblender commented 7 years ago

Hi Jeremy,

This PR makes it possible to use 2 features:

Example file in S3 (with a node with 32 tokens. It will be placed in :/tokens.yaml).

-9077317996469546256, -8863349240261109614, -8421444024767824091, -7567840383426059561, -6427240226670512666, -6143170743294629874, -6077662968604323298, -5799186478977776950, -5732030913645011502, -5630823829365981928, -5511837005190718500, -4975999313611487520, -3095220848989696878, -2883380728344320335, -2838164825870388770, -2528093677396508490, -113798695540279191, 911430545312801057, 1328924741345247047, 1490885714324738803, 1622295851747942555, 1952426730596669822, 2109975875433605992, 2180150360351360519, 2795170937096319844, 2816376218517493285, 3032774582364902502, 3340478910114050221, 3823298476926254629, 4567837931791487608, 5741266889733696120, 7355327284946513835,

The other provides the option to freshen keys - this could be useful for people who set lifecycle policies on their buckets. So everytime a listdir.json is uploaded, all files in that list will be copied-in-place to update their timestamps. So, e.g,if you have a lifecycle of 7 days, as long as the SSTables (which are immutable) are still in listdir.json, they will be 'freshened' as if they had been uploaded again. Those which are no longer part of the set while eventually die out after 7 days.

It looks like this:

2017-10-18 04:31:59,163 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Digest.crc32
2017-10-18 04:31:59,179 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Filter.db
2017-10-18 04:31:59,222 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Digest.crc32
2017-10-18 04:31:59,224 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-CompressionInfo.db
2017-10-18 04:31:59,275 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-Index.db
2017-10-18 04:31:59,283 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-CompressionInfo.db
2017-10-18 04:31:59,297 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-Digest.crc32
2017-10-18 04:31:59,340 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Summary.db
2017-10-18 04:31:59,349 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Statistics.db
2017-10-18 04:31:59,382 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Data.db
2017-10-18 04:31:59,386 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-Digest.crc32
2017-10-18 04:31:59,400 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-Summary.db
2017-10-18 04:31:59,444 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Index.db
2017-10-18 04:31:59,464 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Digest.crc32
2017-10-18 04:31:59,486 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Summary.db
2017-10-18 04:31:59,495 INFO Freshened key tester:/tokens.yaml
2017-10-18 04:31:59,519 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-CompressionInfo.db
2017-10-18 04:31:59,552 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Data.db
2017-10-18 04:31:59,577 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-CompressionInfo.db
2017-10-18 04:31:59,606 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Index.db
2017-10-18 04:31:59,619 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Statistics.db
2017-10-18 04:31:59,648 INFO Freshened key tester:/tokens.yaml
2017-10-18 04:31:59,660 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-7-big-Digest.crc32
2017-10-18 04:31:59,728 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-6-big-Summary.db
2017-10-18 04:31:59,766 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Data.db
2017-10-18 04:31:59,928 INFO Freshened key tester:/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-5-big-Index.db
2017-10-18 04:31:59,988 INFO Freshened key tester:/tokens.yaml

I realise that tablesnap does have its drawbacks with regards to the way I've implemented freshening - that is that because it uses pyinotify and a queue, there's no way I can guarantee whether Data.db or Index.db or Toc.txt or whatever gets put in the queue first. This means there's a chance that Data.db comes first, based on listdir.json my patch tries to freshen keys, and it will freshen everything, including the new companion files to Data.db, namely

mc-1-big-Data.db       mc-1-big-Filter.db  mc-1-big-Statistics.db  mc-1-big-TOC.txt
mc-1-big-CompressionInfo.db  mc-1-big-Digest.crc32  mc-1-big-Index.db   mc-1-big-Summary.db

that have not been uploaded yet. At the moment I just ignore it. Also, everytime a new SSTable is created (or new Data.db is created), it will freshen all SSTables that are still recognized by Cassandra. S3 has a 300 request rate per limit for PUT, so this could add up. I could always use a timer for my purposes to periodically (e.g, daily) get the latest listdir.json for each and every KS/table combo and based on the files in that set, freshen. However I was thinking that tablesnap could always provide this option to make it a more general meets-peoples-needs package for Cassandra backups. The downsides (and extra guff into the code) may not be worth it. Let me know what you think - it's ok if you don't think it's a good idea; I'll just create a timer for my own purposes.

JeremyGrosser commented 7 years ago

I think the idea of storing the tokens is good, but I'm not so sure about the implementation here... For starters, calling out to three subprocesses to do simple text processing is really bad. We should be reading the output of nodetool ring or connecting to Cassandra's API directly, then formatting the output in Python. I also think we should use JSON instead of YAML, just for consistency with the listdir.json file.

I'm also a bit hesitant to add a dependency on having Cassandra running here... If Cassandra is restarted while tablesnap is running, we need to make sure tablesnap stays up. At a minimum, a lot more error handling is needed.

I'm definitely against setting up a timer... That should be a cron job that calls a separate script.

rcolidba commented 7 years ago

Right off the top, thanks for contributing to upstream tablesnap! It's great that people continue to find this codebase useful, and especially awesome when they want to share their improvements.

Below is conceptual commentary only, not implementation-specific.

1) Storing tokens is a good idea, though I wonder if a more generic approach to storing "cluster data you need alongside your backup set to actually use it" would be more future-proof. +1 from me.

2) I'm on the fence regarding a feature which resets all of the timestamps on files to be the time of backup instead of their on-disk time, but perhaps I don't understand the new paradigm of retention you're proposing. Before-and-after examples of the freshening feature for a given small set of files would be useful to understand the operation of the "freshening" in practice.

As currently conceived, tablesnap snapshots backup sets and sets their timestamps to their time on disk, then tablechop parses listdir.json files and assembles a set of keeper files using those timestamps.

It sounds like you are proposing a different paradigm where the operator is able to use bucket retention times to accomplish the same goal. I agree that this is a desirable goal, because tablechop is expensive, especially with bulk operations or compaction strategies which result in large amounts of short-lived SSTable files. The downside seems to be that the timestamps of files in S3 will now differ from their timestamp on disk?

A brief think about this new paradigm does suggest that it might be better. Have I correctly understood what you are proposing, and if so could you share your experience using this code on actual backup sets?

3) If we went in the direction of 2), I believe I would want this paradigm to replace the old, which probably creates a migration challenge. I am -1 on trying to maintain 2 different retention paradigms or trying to explain both to users.

4) As a minor quibble, I believe am opposed in principle to tablesnap parsing remote listdir.json files, in the current paradigm (IIRC) that is only done by tablechop and I think that is an appropriate separation of concerns. I also believe that tablesnap, as the most crucial component and one which runs continuously, should have a design goal of relative lightweight simplicity.

juiceblender commented 7 years ago

Thanks guys!

@JeremyGrosser Yep - never meant to include a timer into this project. I'll create my own systemd timer to suit my purposes.

Also you're right on, with Java I would just directly query JMX. With Python...I hadn't done enough research to check between Jython and Jpype and Jolokia. Also as you said it introduces a dependency so maybe it's better to let people separately manage it themselves.

@rcolidba You're bang on. From what I can see there are 2 options to managing the lifecycle of backups:

  1. (The current way): A cron/systemd timer periodically calls tablechop and gets rid of everything that's old up to a certain timestamp. I don't know if it's expensive or not as I hadn't personally used tablechop. Sounds like it could be expensive.

  2. (The way I was proposing): Lifecycle will be managed by AWS. So in this model, we create a lifecycle policy that say, expires SSTables in 30 days (so our backup retention period is 30 days).

With Cassandra backups, we typically upload a manifest file detailing what SSTables are involved in the backup. In tablesnap, this has conveniently already been implemented, it's listdir.json.

So, everytime a listdir.json is uploaded, we take every entry in it and 'freshen' the keys. This is a neat trick which copies a S3 key to itself, which is inexpensive and quick and only serves to update its timestamp to 'reset' it for the lifecycle policy. So over time, the SSTables that still make up the data set will continuously get their lifecycle refreshed; those that have been compacted out of Cassandra's SSTables will eventually expire from the lifecycle.

We have used it in Production for over 1000 nodes, so we know it works. However, the model we follow is periodic backup (which will also freshen all keys based on manifest/listdir.json) + CLog upload (We wrote it ourselves, in Java).

And that's where the conflict comes into with tablesnap. Tablesnap uploads files the moment they even appear on disk...so using the same model, the number of API requests could be a lot depending on workload! (AWS also charges per 1k requests).

An example to illustrate:

Periodic backup - every day say. So every day each node freshens all keys based on the manifest it backs up. Tablesnap - every time any node flushes a SSTable for any KS/Table combo, it freshens all keys. 😮

I originally thought this was a good idea, but now I think it's just plain weird. It doesn't fit into this model that well because tablesnap is unique in flushing everything to S3 once it's ready. I had to introduce quirks such as having no guarantee on which of the SStables files get put in the queue first; if it's Data.db (which is also the point in time we generate listdir.json, it necessarily can't freshen all the new keys that are yet-to-be-uploaded. In our own solution, it's periodic so we could).

I think I'll close this PR until I have further time to work on getting a tokens.yaml up in a better way using one of the aforementioned libraries if we do plan to introduce such a dependency...it does go through my local tests but it just doesn't feel like it fits tablesnap's model. Also as you said at the very end, it's an appropriate separation of concerns.

Thanks again. I will be working a bit more on tableslurp for now..