Open asmagill opened 3 years ago
Having the ability to define what file extensions should be ignored (or allowed) would be great. It would be awesome if you could tell it to only find .lua
files, but on the flip side, find everything accept .DS_Store
files, etc.
Even if the main function just reports individual files in a table, it would still be handy to have a "helper" function which returns a single hash.
I trust whatever you decide in regards to sym-links, however I'd be tempted to just ignore them. If the user wants hashes of sym-links, they can always do this work themselves before hashing?
Eventually it would be great to also support MD5, SHA1 and CRC32b checksums.
This might be of interest?
Note to self: DOCUMENT LUASKIN!!!
Was planning to implement object wrapper for this to minimize transfer of potentially huge report table for hashed directory unless absolutely necessary and then come to find that I already did this for hs.doc
in a generic way for re-use in situations like this and completely forgot about it... 🤦♂️
Isn't it documented here?
Incomplete and probably out of date. And we have some lua land helpers (well one, but it's on my list to add a couple of more) that aren't included.
Is that updated when @cmsj does release builds? Because if it's the original one I created as a POC with AppleDoc then it's definitely out of date.
And even if it's kept current, if made with AppleDoc
, it's probably got at least some formatting errors if not outright omissions as well because last I checked, the app was no longer being actively maintained and every time I run it locally, I get about 50 warnings of things ignored or missing.
It's been a low priority, but off and on I look for a good replacement for AppleDoc
and haven't really found one.
It says it was updated Tuesday, September 29, 2020, so I'm ASSUMING it gets automatically updated when @cmsj pushes out a release?
@latenitefilms, for SHA1/224/256/384/512 and MD5, I can use the CommonCrypto library... is there something similar for CRC32b checksums?
For these additions, I'm thinking rather than replicate the individual functions like we currently do in hs.hash
for generating a hash of a simple string, there should be one function that takes as an argument the algorithm to use. Something like:
hs.hash.forFile(hash, [key], path) -> string
hash
is a string containing "MD5", "SHA1", etc.key
is a string containing the secret key and is only required if hash
is one of the HMAC versions (e.g. hmacSHA1
, etc.)path
is a string containing the path to the filehs.hash.forDirectory(hash, [key], path, [modifiers]) -> { { path = "...", hash = "..." }, ... }
, fileCount, dirCount
hash
and key
are as described abovepath
is a string containing the path to the directorymodifiers
is an optional table containing zero or more of the following key-value pairs (arbitrary unordered optional arguments is one area where I have to say Python is nicer than Lua...)subdirs
- boolean, default false, specifying if subdirectories should be traversed as wellignore
- a table as an array of strings specifying patterns (probably regex) specifying files to skip (e.g. { ".DS_Store", "\.s?o$", etc. }
. Defaults to empty table.allow
- a table as an array of strings specifying patterns (probably regex) specifying files to allow. Defaults to { ".*" }
(i.e. all files).truncatePathHead
- boolean default false specifying that the path in the returned table should suppress the initial path
specified in the argument (e.g. init.lua
vs. /Users/XXXX/.hammerspoon/init.lua
).hs.hash.hashDirectoryReport(hash, [key], table, [format])
-> string`
hash
and key
are as described above with one change -- "clear" also accepted for hash
table
is a table returned by hs.hash.forDirectory
format
is an optional string, default "%h %p\n" specifying format for each line in table
table
to keep things as speedy as possible, this allows hashing the report as described by @dge9 without first bringing the entire dataset into Lua. The "clear" value for hash
will just return the report as a string; other hash values will return the hash of the report. The default format
is chosen to match the output of find <path> -type f -print | sort | xargs shasum -a512
May also add __call
to hs.hash
so this same style can be used for existing functions, e.g. hs.hash("MD5", string)
is equivalent to hs.hash.MD5(string)
, hs.hash("hmacSHA256", key, string)
is equivalent to hs.hash.hmacSHA256(key, string)
, etc.
Unless there are objections, I think I'm going to ignore symlinks within a directory traversal... it opens up too many questions... what to do about broken links? what if multiple links point to the same directory -- do we include those same files multiple times? should files in symlinked dir be reported with actual path or one from initial starting point? Plus, preventing loops requires change in approach that slows down traversal.
Thoughts?
(For CRC32, see zlib.h
at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/zlib.h
)
All sounds great to me! Can't wait to take hs.hash.forDirectory()
for a spin!
@asmagill - Did you ever get hs.hash.forDirectory()
up and running?
Actually no, this kind of got forgotten in the last couple of months.
What was the final behavior we decide on for directories? Hash individual files and return a table which could then be concatenated and hashed if desired, or iterate through in repeatable manor and return one hash for the "whole contents" of the directory?
I'd prefer a single hash for the directory, so that I can use it as a check to see if the directory contents has changed.
See comment above (https://github.com/Hammerspoon/hammerspoon/issues/2544#issuecomment-702584248) -- I think this might be the most flexible... hs.hash.hashDirectoryReport(hash, [key], hs.hash.forDirectory(hash, [key], path, [modifiers]))
would give you what you want while leaving it open to easily create the kind of reports you sometimes see on web sites where they list individual files and their specific hash value.
Objections?
Sounds great!
@asmagill - Reminder, I'd still love to see this happen if possible? Sorry - I know you're busy and Hammerspoon is lower on the priority list, but just pinging you in-case you do have time to look at it again.
I'm going to have some free time in late May/early June, and I hope to revisit a number of half finished Hammerspoon enhancements I've worked on. I'll make sure to add this to the list.
FYI, revisiting this (and portions of #2514). New thoughts or addition ideas should be put here and I'm going to try and have something ready for testing in the next week. I've got an idea that modularizes the actual hashing function so adding a new hash type becomes available to all of the helper functions with just a few minor tweaks rather than the current approach which requires separate functions and code duplication, and have started a rewrite of hs.hash
(though maintaining backwards compatibility through lua __metamethod magic) that I hope to have available soon.
@cmsj, @latenitefilms
The function which grabs the file paths for generating a hash report seems more like it belongs in hs.fs
since all it does is return a table of paths... I forget, are we trying to keep our changes in submodules of fs
so it maintains some resemblance to the luafs package, or did we decide to just forgo that?
Maybe you could keep the Objective-C hash code in hs.hash
as a private function, that you then call in hs.fs
in Lua?
The hashing code is going to be in hs.hash
-- but the support function that creates the list of files to be hashed seems more like it belongs in hs.fs
.. generating the report isn't a single monolithic function, rather a set of functions you can script in the way you wish -- I will be providing samples when it's ready.
If we decide that for security/hardening purposes that we need something more opaque/hard to modify, that can be added later -- first we need the right functionality.
Ok, for those interested in taking a look at it, I've just pushed my code for the new approach to hs.hash to https://github.com/asmagill/hammerspoon_asm/tree/master/hash as a backup.
It's currently undocumented, but implements all of the existing functions of hs.hash
as well as adding CRC32 and SHA3_224, SHA3_256, SHA3_384, and SHA3_512. edit - fixed size
SHA3 was added as a proof of concept that a new hash can be easily added and it worked perfectly -- I copied 3 files from https://github.com/rhash/RHash, tweaked maybe 10 lines to clear warnings and errors, and then made 2 changes to libhash.m
... in theory, adding any hash from https://github.com/rhash/RHash or https://github.com/krzyzanowskim/CryptoSwift should be similarly easy.
To do:
fileListForPath
to hs.fs
(I really feel it belongs there, unless there is disagreement)I suspect I can have this completed in the next couple of days, but figured I should show my progress since this has languished for so long.
Should we include MD2, MD4, SHA224, SHA384, hmacSHA224, and hmacSHA384? Including them is trivial as the macOS API provides them.
Like MD5, they are marked deprecated in the macOS API because they have either been broken or techniques have been found that can significantly reduce their search space... but they do still crop up from time to time when working with older data or software. And MD5 is just too ubiquitous to leave out.
Technically all of the SHA-2 algorithms have the same weakness (even the SHA256 and SHA512 that we leave in) but they are still in wide use because of inertia and the fact that while the search space has been shown to be weakened, taking advantage of it isn't trivial... yet.
The hmac algorithms are considered a little better because they also require a shared secret, but NIST is recommending everyone move to SHA3 as soon as possible (in fact I'm a little surprised Apple doesn't support it natively yet) so...
I guess the question is are we seeing this as swiss army knife that should provide as many tools as reasonably possible, or as a targeted tool that we want to narrow to known useful and future facing tools?
I reckon just keep the most commonly used ones for now. We can always add additional ones later if user's request them.
@latenitefilms Since you requested CRC32b, I wanted to make sure we're on the same page...
My research has led me to understand that CRC32b is often just referred to as CRC32 and is what the zlib library package found on practically all posix systems provides.
CRC32c is something different (well basically the same algorithm but with a different starting polynomial), usually optimized for Intel processors. For your use case, just to be clear, we don't need to implement CRC32c, correct?
To be honest, I can't really remember why I wrote CRC32b. I ASSUME it's because MD5, SHA1 and CRC32b checksums are commonly used for data wrangling apps like Hedge, ShotPut Pro, etc. So yes, I assume CRC32 is what I meant.
Rather than fill up #2514, it has come time to move this aspect of the discussion to its own thread.
The proposal is to add functions to provide hashes for individual files and for directories and their contents.
I think I'm leaning in @dge9's direction that when traversing a directory, a report of the files with individual hashes should be returned. Outside of the uses being considered in the above mentioned pull, the most common use I can think of would be to compare the hashes of files against those reported elsewhere (e.g. on a maintainers web site) and they will generally be listing a hash per file rather than relying on the file contents being merged in just the correct way before generating the hash.
As to the question of
.DS_Store
, what makes the most sense to me is add an argument which allows you to pass in a table of strings specifying files or subdirectories to ignore. I can imaging using this against, say a thumb drive, to verify its contents haven't changed, and there are about 4 (5? it's been a while) separate files I generally create on these to minimize the crap that Finder and Spotlight put on them.The question is symbolic links... in order to resolve them and make sure that pathological loops can't lock things up, I had to add code to track the directories already visited which took just under 1.5 times the time that my original function did... both under a second, though, and that was traversing 515 directories and including 850 files.
Not sure what the changes to the report format may cause, because could theoretically be a pretty large string (or array of strings, not sure which yet) that has to be piped back into Lua, so this difference might be inconsequential... I'll know more tomorrow when I get a chance to work on the refactoring.
In the mean time, the main questions I have are: