Closed etorreborre closed 11 years ago
And just to expand a little, the reason I'm not doing "snapshot-foo-v3"
and bumping that string, is because of the amount of garbage it would leave around. So I think the clean-up aspect should be considered an essential part.
A real shame is that HDFS doesn't appear to support extended file attributes, which would've seemed to be a perfect place to store this information, but as its probably bad to have two different code paths for HDFS vs non-HDFS. I guess the two options of storing the information are either in the checkpoints filename itself, or keeping it in some file (along with the created files inode/creation time, so you know you're referring to the correct one). But personally, I'd prefer abusing the filename for adding metadata -- as it's more transparent.
My first thought was the expiration date would use the file creation date, to determine if it's valid or not -- but its probably a pain dealing with timezone stuff (My cluster uses a different timezone than my client, lol). And actually might be a bit too limiting.
Like for instance, say the input to my jobs is based on a weekly database dump -- one that is promised to be available by 0700 on Monday. So actually my expiration policy shouldn't be -1 week expiration, but rather, I'd set it to an absolute time to the following 0700 Monday. As such, this information would need to be stored somewhere.
Now, if we want to completely over-engineer it -- it could be generic.
trait ExpirationPolicy {
def policyName: String // no funny characters, or =
def createFileNameApendix: String
def isSnapshotValid(fileNameApendix: String): Bool
}
I don't have an immediate use for it, but I can definitely see one. Like in my custom ExpirationPolicy, I could actually query an external service to see if anything has changed or a new data dump is available. (And deciding what I want to do if it's available late or early)
And an implementation for absolute/relative/versionbumping could be provided.
and I supposed the filename could look like: scoobiCheckPointName + "~$ScoobiCheckPoint$~" + policyName + "=" + expirationPolicyString
(The policy name is there, so if you change your expiration policy, you don't feed a garbage string to the new one)
(It's probably a good idea to leave the prefix as the snapshot name, to make it easy to do rm checkpointName*
)
: D
The existence of a _SUCCESS file should also be included in the "expiry" policy of a checkpoint. See the discussion here
fixed with f974c50
From the mailing-list: