NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Add an expiry date on snapshots and a version number #257

Closed etorreborre closed 11 years ago

etorreborre commented 11 years ago

From the mailing-list:

BTW perhaps a useful addition would be an expiration policy on the snapshots. Right now, I wrote a little script that does a delete on everything older than a week -- but it'd be kinda cool if you could set a max snapshot date (e.g. 1 week) and have have a version number that you can easily bump. (So when you make code/logic changes, you just increment a number in the .snapshot call ) and it'll get recomputed

espringe commented 11 years ago

And just to expand a little, the reason I'm not doing "snapshot-foo-v3" and bumping that string, is because of the amount of garbage it would leave around. So I think the clean-up aspect should be considered an essential part.

A real shame is that HDFS doesn't appear to support extended file attributes, which would've seemed to be a perfect place to store this information, but as its probably bad to have two different code paths for HDFS vs non-HDFS. I guess the two options of storing the information are either in the checkpoints filename itself, or keeping it in some file (along with the created files inode/creation time, so you know you're referring to the correct one). But personally, I'd prefer abusing the filename for adding metadata -- as it's more transparent.

My first thought was the expiration date would use the file creation date, to determine if it's valid or not -- but its probably a pain dealing with timezone stuff (My cluster uses a different timezone than my client, lol). And actually might be a bit too limiting.

Like for instance, say the input to my jobs is based on a weekly database dump -- one that is promised to be available by 0700 on Monday. So actually my expiration policy shouldn't be -1 week expiration, but rather, I'd set it to an absolute time to the following 0700 Monday. As such, this information would need to be stored somewhere.

Now, if we want to completely over-engineer it -- it could be generic.

trait ExpirationPolicy {
     def policyName: String  // no funny characters, or = 
     def createFileNameApendix: String
     def isSnapshotValid(fileNameApendix: String): Bool
}

I don't have an immediate use for it, but I can definitely see one. Like in my custom ExpirationPolicy, I could actually query an external service to see if anything has changed or a new data dump is available. (And deciding what I want to do if it's available late or early)

And an implementation for absolute/relative/versionbumping could be provided.

and I supposed the filename could look like: scoobiCheckPointName + "~$ScoobiCheckPoint$~" + policyName + "=" + expirationPolicyString

(The policy name is there, so if you change your expiration policy, you don't feed a garbage string to the new one)

(It's probably a good idea to leave the prefix as the snapshot name, to make it easy to do rm checkpointName*)

: D

etorreborre commented 11 years ago

The existence of a _SUCCESS file should also be included in the "expiry" policy of a checkpoint. See the discussion here

etorreborre commented 11 years ago

fixed with f974c50