airbreather / StepperUpper

Some tools for making STEP happen.
MIT License
9 stars 2 forks source link

Fast-Path File Checking #9

Open airbreather opened 7 years ago

airbreather commented 7 years ago

Use Case

Due to the nature of setting this up, I've run variants of this process hundreds of times. At some point, I had this fast-path support that would create .md5 files next to the files being checked, and as long as the .md5 file was not modified before the modification time of the underlying file itself, the checksum stored in the *.md5 file would be used as-is without truly checking.

This was great for development, but it created a colossal mess in the download and Skyrim Data directories, so I removed it. There's something to be said about this idea, though... if you know for a fact that you have all the required files (perhaps because you are a pack creator, perhaps because a task failed before and you're re-running the process, perhaps because you're really smart), then you shouldn't be required to subject yourself to the file checking.

One wrinkle in this: a user could rename all their files to a bunch of garbage, and as long as they're the right files (by length and MD5 checksum), StepperUpper won't know the difference.

Proposed Solution 1

An optional command-line argument (or the equivalent) that lets a user say "All my files have their canonical names, so you can assume that if the length and name match the values specified in the XML file, then the MD5 checksum matches as well".

If a full check would have indicated a mismatch, behavior is undefined.

Proposed Solution 2

An optional command-line argument (or the equivalent) that lets a user instruct us to cache MD5 checksums persistently so that a file does not need to be re-checked after checking it for the first time. Ideally, in a way that doesn't require us to create *.md5 files galore. Perhaps make it two command-line arguments: one that says "use cached checksums if they exist", and one that says "create cached checksums if we did not use them originally".

If a full check would have indicated a mismatch, behavior is undefined.

Concerns

I'm particularly concerned about any files that might be allocated to the right size (with garbage data filling in the parts that aren't downloaded yet), with the correct name, and then have the "true" data downloaded into the right spots of the file, like what aria2 does. This messes with Proposed Solution 1 if the user cancels the download (or if it hasn't completed yet for some other reason).

I'm also a little concerned about data storage irregularities in Proposed Solution 2. If a bit somehow gets flipped on the storage medium, the problem might not get detected and something will probably randomly fail. This is very unlikely to ever happen, but I think the likelihood goes up a bit with the "speed is top priority" way that many likely set up their hardware rigs.

Overall, I'm not really sure this is particularly worthwhile to spend the time to do right now... the MD5 checking process just seems too fast to be worth optimizing.

airbreather commented 7 years ago

If we assume (because of an optional --pleaseAssumeThis flag, not by default) that any md5sum-formatted lines in any text files in a source directory matching *.md5 contain checksums that we can trust (discarding non-matching lines), then combined with file lengths for an additional quick check, this is actually pretty safe.

krageon commented 7 years ago

Having dealt with a couple of re-hashing iterations and feeling this pain - I don't see any particular harm in having a little database of hashes with filenames, modification dates and sizes. It'd save a huge amount of time when iterating over "did I download everything correctly" sets a few times and it should get around most glaring issues. Having a force-recheck option that re-does all the hashes and re-caches them could take care of the final pass check. You'd have the best of both worlds: Fast intermittent iteration and an optional final thorough step.

As for the messiness in directories, I don't see it being a massive issue. If it is, another flag for disabling caching might be an option.