Hammerspoon / hammerspoon

Staggeringly powerful macOS desktop automation with Lua
http://www.hammerspoon.org
MIT License
11.9k stars 578 forks source link

Adding file hashing functions to hs.hash #2544

Open asmagill opened 3 years ago

asmagill commented 3 years ago

Rather than fill up #2514, it has come time to move this aspect of the discussion to its own thread.

The proposal is to add functions to provide hashes for individual files and for directories and their contents.

I think I'm leaning in @dge9's direction that when traversing a directory, a report of the files with individual hashes should be returned. Outside of the uses being considered in the above mentioned pull, the most common use I can think of would be to compare the hashes of files against those reported elsewhere (e.g. on a maintainers web site) and they will generally be listing a hash per file rather than relying on the file contents being merged in just the correct way before generating the hash.

As to the question of .DS_Store, what makes the most sense to me is add an argument which allows you to pass in a table of strings specifying files or subdirectories to ignore. I can imaging using this against, say a thumb drive, to verify its contents haven't changed, and there are about 4 (5? it's been a while) separate files I generally create on these to minimize the crap that Finder and Spotlight put on them.

The question is symbolic links... in order to resolve them and make sure that pathological loops can't lock things up, I had to add code to track the directories already visited which took just under 1.5 times the time that my original function did... both under a second, though, and that was traversing 515 directories and including 850 files.

Not sure what the changes to the report format may cause, because could theoretically be a pretty large string (or array of strings, not sure which yet) that has to be piped back into Lua, so this difference might be inconsequential... I'll know more tomorrow when I get a chance to work on the refactoring.

In the mean time, the main questions I have are:

  1. objections to moving to the report of individual files approach?
  2. do we include a flag to optionally attempt to resolve symbolic links and include them (and possibly their contents, if they point to a directory)? Or do we simply skip them?
latenitefilms commented 3 years ago

Having the ability to define what file extensions should be ignored (or allowed) would be great. It would be awesome if you could tell it to only find .lua files, but on the flip side, find everything accept .DS_Store files, etc.

Even if the main function just reports individual files in a table, it would still be handy to have a "helper" function which returns a single hash.

I trust whatever you decide in regards to sym-links, however I'd be tempted to just ignore them. If the user wants hashes of sym-links, they can always do this work themselves before hashing?

latenitefilms commented 3 years ago

Eventually it would be great to also support MD5, SHA1 and CRC32b checksums.

This might be of interest?

https://github.com/jerolimov/NSHash

asmagill commented 3 years ago

Note to self: DOCUMENT LUASKIN!!!

Was planning to implement object wrapper for this to minimize transfer of potentially huge report table for hashed directory unless absolutely necessary and then come to find that I already did this for hs.doc in a generic way for re-use in situations like this and completely forgot about it... 🤦‍♂️

latenitefilms commented 3 years ago

Isn't it documented here?

https://www.hammerspoon.org/docs/LuaSkin/

asmagill commented 3 years ago

Incomplete and probably out of date. And we have some lua land helpers (well one, but it's on my list to add a couple of more) that aren't included.

Is that updated when @cmsj does release builds? Because if it's the original one I created as a POC with AppleDoc then it's definitely out of date.

And even if it's kept current, if made with AppleDoc, it's probably got at least some formatting errors if not outright omissions as well because last I checked, the app was no longer being actively maintained and every time I run it locally, I get about 50 warnings of things ignored or missing.

It's been a low priority, but off and on I look for a good replacement for AppleDoc and haven't really found one.

latenitefilms commented 3 years ago

It says it was updated Tuesday, September 29, 2020, so I'm ASSUMING it gets automatically updated when @cmsj pushes out a release?

asmagill commented 3 years ago

@latenitefilms, for SHA1/224/256/384/512 and MD5, I can use the CommonCrypto library... is there something similar for CRC32b checksums?


For these additions, I'm thinking rather than replicate the individual functions like we currently do in hs.hash for generating a hash of a simple string, there should be one function that takes as an argument the algorithm to use. Something like:

Unless there are objections, I think I'm going to ignore symlinks within a directory traversal... it opens up too many questions... what to do about broken links? what if multiple links point to the same directory -- do we include those same files multiple times? should files in symlinked dir be reported with actual path or one from initial starting point? Plus, preventing loops requires change in approach that slows down traversal.

Thoughts?

asmagill commented 3 years ago

(For CRC32, see zlib.h at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/zlib.h)

latenitefilms commented 3 years ago

All sounds great to me! Can't wait to take hs.hash.forDirectory() for a spin!

latenitefilms commented 3 years ago

@asmagill - Did you ever get hs.hash.forDirectory() up and running?

asmagill commented 3 years ago

Actually no, this kind of got forgotten in the last couple of months.

What was the final behavior we decide on for directories? Hash individual files and return a table which could then be concatenated and hashed if desired, or iterate through in repeatable manor and return one hash for the "whole contents" of the directory?

latenitefilms commented 3 years ago

I'd prefer a single hash for the directory, so that I can use it as a check to see if the directory contents has changed.

asmagill commented 3 years ago

See comment above (https://github.com/Hammerspoon/hammerspoon/issues/2544#issuecomment-702584248) -- I think this might be the most flexible... hs.hash.hashDirectoryReport(hash, [key], hs.hash.forDirectory(hash, [key], path, [modifiers])) would give you what you want while leaving it open to easily create the kind of reports you sometimes see on web sites where they list individual files and their specific hash value.

Objections?

latenitefilms commented 3 years ago

Sounds great!

latenitefilms commented 2 years ago

@asmagill - Reminder, I'd still love to see this happen if possible? Sorry - I know you're busy and Hammerspoon is lower on the priority list, but just pinging you in-case you do have time to look at it again.

asmagill commented 2 years ago

I'm going to have some free time in late May/early June, and I hope to revisit a number of half finished Hammerspoon enhancements I've worked on. I'll make sure to add this to the list.

asmagill commented 1 year ago

FYI, revisiting this (and portions of #2514). New thoughts or addition ideas should be put here and I'm going to try and have something ready for testing in the next week. I've got an idea that modularizes the actual hashing function so adding a new hash type becomes available to all of the helper functions with just a few minor tweaks rather than the current approach which requires separate functions and code duplication, and have started a rewrite of hs.hash (though maintaining backwards compatibility through lua __metamethod magic) that I hope to have available soon.

asmagill commented 1 year ago

@cmsj, @latenitefilms

The function which grabs the file paths for generating a hash report seems more like it belongs in hs.fs since all it does is return a table of paths... I forget, are we trying to keep our changes in submodules of fs so it maintains some resemblance to the luafs package, or did we decide to just forgo that?

latenitefilms commented 1 year ago

Maybe you could keep the Objective-C hash code in hs.hash as a private function, that you then call in hs.fs in Lua?

asmagill commented 1 year ago

The hashing code is going to be in hs.hash -- but the support function that creates the list of files to be hashed seems more like it belongs in hs.fs .. generating the report isn't a single monolithic function, rather a set of functions you can script in the way you wish -- I will be providing samples when it's ready.

If we decide that for security/hardening purposes that we need something more opaque/hard to modify, that can be added later -- first we need the right functionality.

asmagill commented 1 year ago

Ok, for those interested in taking a look at it, I've just pushed my code for the new approach to hs.hash to https://github.com/asmagill/hammerspoon_asm/tree/master/hash as a backup.

It's currently undocumented, but implements all of the existing functions of hs.hash as well as adding CRC32 and SHA3_224, SHA3_256, SHA3_384, and SHA3_512. edit - fixed size

SHA3 was added as a proof of concept that a new hash can be easily added and it worked perfectly -- I copied 3 files from https://github.com/rhash/RHash, tweaked maybe 10 lines to clear warnings and errors, and then made 2 changes to libhash.m... in theory, adding any hash from https://github.com/rhash/RHash or https://github.com/krzyzanowskim/CryptoSwift should be similarly easy.

To do:

I suspect I can have this completed in the next couple of days, but figured I should show my progress since this has languished for so long.

asmagill commented 1 year ago

Should we include MD2, MD4, SHA224, SHA384, hmacSHA224, and hmacSHA384? Including them is trivial as the macOS API provides them.

Like MD5, they are marked deprecated in the macOS API because they have either been broken or techniques have been found that can significantly reduce their search space... but they do still crop up from time to time when working with older data or software. And MD5 is just too ubiquitous to leave out.

Technically all of the SHA-2 algorithms have the same weakness (even the SHA256 and SHA512 that we leave in) but they are still in wide use because of inertia and the fact that while the search space has been shown to be weakened, taking advantage of it isn't trivial... yet.

The hmac algorithms are considered a little better because they also require a shared secret, but NIST is recommending everyone move to SHA3 as soon as possible (in fact I'm a little surprised Apple doesn't support it natively yet) so...

I guess the question is are we seeing this as swiss army knife that should provide as many tools as reasonably possible, or as a targeted tool that we want to narrow to known useful and future facing tools?

latenitefilms commented 1 year ago

I reckon just keep the most commonly used ones for now. We can always add additional ones later if user's request them.

asmagill commented 1 year ago

@latenitefilms Since you requested CRC32b, I wanted to make sure we're on the same page...

My research has led me to understand that CRC32b is often just referred to as CRC32 and is what the zlib library package found on practically all posix systems provides.

CRC32c is something different (well basically the same algorithm but with a different starting polynomial), usually optimized for Intel processors. For your use case, just to be clear, we don't need to implement CRC32c, correct?

latenitefilms commented 1 year ago

To be honest, I can't really remember why I wrote CRC32b. I ASSUME it's because MD5, SHA1 and CRC32b checksums are commonly used for data wrangling apps like Hedge, ShotPut Pro, etc. So yes, I assume CRC32 is what I meant.