1206e40 Just provides an option to log every file tablechop removes from S3, nice for the purposes of auditing it's work
06bca99 is one more try/except for a missing file. I logged all pyinotify events, and compared to tablesnap logs and saw this sequence:
File A fires IN_MOVED_TO
File A fires IN_DELETE (just milliseconds later)
Tablesnap logs "Failed uploading A" (which is followed by the os.kill())
File B fires it's IN_MOVED_TO
In our case file B never made it to S3. At some point puppet would restart tablesnap, but during those gaps I was seeing a lot of missed backups. I believe that during the time tablesnap is consulting S3 for key_exists() compaction was removing the file causing the subsequent open() to fail.
Overall in a 12 node cluster I was seeing occurrences of the "Failed uploading X" error 5 - 10 times per day. With this try/except I haven't seen any so far, and am not missing any backups to date.
1206e40 Just provides an option to log every file tablechop removes from S3, nice for the purposes of auditing it's work
06bca99 is one more try/except for a missing file. I logged all pyinotify events, and compared to tablesnap logs and saw this sequence:
In our case file B never made it to S3. At some point puppet would restart tablesnap, but during those gaps I was seeing a lot of missed backups. I believe that during the time tablesnap is consulting S3 for key_exists() compaction was removing the file causing the subsequent open() to fail.
Overall in a 12 node cluster I was seeing occurrences of the "Failed uploading X" error 5 - 10 times per day. With this try/except I haven't seen any so far, and am not missing any backups to date.