JeremyGrosser / tablesnap

Uses inotify to monitor Cassandra SSTables and upload them to S3
BSD 2-Clause "Simplified" License
181 stars 86 forks source link

Tableslurp to be able to restore CLogs #88

Closed juiceblender closed 6 years ago

juiceblender commented 6 years ago

Hi,

Just wondering what people thought about making tableslurp able to restore CLogs. An example output I have is here:

tableslurp -n 172.31.10.6 --aws-region ap-northeast-2 --commitlogs /cassandra/commitlog_archive lerhconsultest /cassandra/data backup-test/
tableslurp [2017-10-19 00:22:21,218] INFO Building fileset
tableslurp [2017-10-19 00:22:21,724] INFO Will now try to test writing to the target dir backup-test/
tableslurp [2017-10-19 00:22:21,724] INFO Will write to backup-test/
tableslurp [2017-10-19 00:22:21,724] INFO Running
tableslurp [2017-10-19 00:22:21,724] INFO Pushing file CommitLog-6-1508276606523.log onto queue
tableslurp [2017-10-19 00:22:21,724] INFO Pushing file CommitLog-6-1508276606522.log onto queue
tableslurp [2017-10-19 00:22:21,725] INFO Pushing file CommitLog-6-1508276606521.log onto queue
tableslurp [2017-10-19 00:22:21,725] INFO Thread #0 processing items
tableslurp [2017-10-19 00:22:21,726] INFO Thread #1 processing items
tableslurp [2017-10-19 00:22:21,728] INFO Thread #2 processing items
tableslurp [2017-10-19 00:22:21,729] INFO Thread #3 processing items
tableslurp [2017-10-19 00:22:21,766] INFO Thread #3 finished processing
tableslurp [2017-10-19 00:22:21,772] INFO Downloading 172.31.10.6:/cassandra/commitlog_archive/CommitLog-6-1508276606523.log from lerhconsultest to backup-test/CommitLog-6-1508276606523.log
tableslurp [2017-10-19 00:22:21,791] INFO Downloading 172.31.10.6:/cassandra/commitlog_archive/CommitLog-6-1508276606522.log from lerhconsultest to backup-test/CommitLog-6-1508276606522.log
tableslurp [2017-10-19 00:22:21,796] INFO Downloading 172.31.10.6:/cassandra/commitlog_archive/CommitLog-6-1508276606521.log from lerhconsultest to backup-test/CommitLog-6-1508276606521.log
tableslurp [2017-10-19 00:22:22,334] INFO Thread #2 finished processing
tableslurp [2017-10-19 00:22:22,431] INFO Thread #0 finished processing
tableslurp [2017-10-19 00:22:22,531] INFO Thread #1 finished processing
tableslurp [2017-10-19 00:22:22,531] INFO My job is done.

In this case, the arguments passed to --commitlogs will be key in the bucket the commitlogs live, the positional arguments are still the same as before. Bucket name, key in bucket where SSTables live and target location to download to.

The logic here now is, because Commitlog filenames are timestamped, we will do the following:

CommitLog-6-1508276606523.log where 1508276606523 is the time in UTC in milliseconds. We will order the CommitLog files in descending order and keep downloading them; the moment we find a CLog timestamp that does not have a timestamp larger than oldest_timestamp, we stop.

What do people think? Would it be something useful to have in tablesnap? You should rarely ever need CLogs in a restore because tablesnap already uploads files to S3 the moment they even show up in the watched directories, but in cases where absolute consistency is needed (in my case), it could be useful.

My reservations for such a thing is if it happens that we have a table which is rarely ever written to (and hence rarely ever flushed), its listdir.json will be really really old. This means that we are going to have to download very very many Commitlogs! (But if the requirements are there, we have no choice because a CLog may have that single 1 transaction that wasn't flushed into SSTable...)