JeremyGrosser / tablesnap

Uses inotify to monitor Cassandra SSTables and upload them to S3
BSD 2-Clause "Simplified" License
181 stars 86 forks source link

tablechop deletes files which currently used by Cassandra #79

Open Mortinke opened 7 years ago

Mortinke commented 7 years ago

tablechop checks only the last modified date from the index_key file to decide if the backuped file will be deleted:

    for index_key in index_keys:
        if days_ago(index_key.last_modified) > args.age:
            break
        index_files_to_keep.add(index_key.name)

For my understanding, tablechop should only delete this files which no longer necessary for restoring the backup in the specified retention time. Depending from the table size and the compaction strategy some SSTables can live weeks or months without being compacted. These files are mandatory to restore the table, regardless of whether they have been in S3 for weeks or months. Currently active SSTables should not be deleted (unless uses a force parameter). Meanwhile there can be exists small backuped SSTables, which have been compacted and are no longer required for restoring.

I'd be grateful if we can add could an additional verification whether the file is currently being used by cassandra/exists on the filesystem.

raags commented 7 years ago

@Mortinke The index file (-listdir.json) contains the list of all sstable files that were present when that particular sstable was being uploaded. So the index file has a snapshot with the complete list of files to restore to a state when that sstable was created. Tablechop loops through the files in the index file(s), and ensures these are not deleted.

Check files_to_keep here : https://github.com/JeremyGrosser/tablesnap/blob/master/tablechop#L87

That said there is a race here since multiple sstables can be uploaded in parallel, so if the node fails some of these latest -listdir.json might not be valid, since some other file referenced may not have been uploaded. This is a problem as of now, @JeremyGrosser is this a known issue?