dslittle22 / bitwrought

A file integrity checker.
8 stars 0 forks source link

Portability and file hashes #1

Open oelna opened 1 year ago

oelna commented 1 year ago

First of all, thank you so much for this! It's almost exactly what I have been looking for for years! (sadly, few people seem to care as much). I'm sure this is not a very high priority project for you, but in case you'd like some feedback, I'll leave it here.

For now, SHA256 seems like a solid choice, but I guess supporting different types of hashes could be reasonable, eg. if stored like $ALGO$HASH, so it's still just one attribute for storage, but easy enough to handle.

I'd also find it interesting to view the hashed data and timestamps from inside bitwrought. I guess this can be done using xattr, but I thought it made sense to also support viewing of the data written.

I don't know much about how extended attributes are handled in general, so forgive me if this is obvious, but: are these attributes preserved if I send the file somewhere, eg. via email or just as a zip file? I'm a bit hesitant to use bitwrought when I don't know if the attributes will persist during backup and restore of files. I have used chkbit and liked it fine, but it stores invisible dot-files with every directory and the xattr solution seemed way better! The dot-files are easy to backup and check manually, though. Sorry for rambling. I'm just excited.

Thanks again!

dslittle22 commented 1 year ago

Hello! I’m glad that you like the project, and that it’s something you’ve been looking for. I mostly expected it to just be something I worked on for fun without many (if any) users, but I’m glad to know that at one person is interested!

Using different hashing algorithms is definitely something I thought about, but I just didn’t get around to putting it in there. You’re absolutely right that it would be easy to add in. For clarity, are you saying the user’s preference could be set as an environment variable and then the xattr path could be sort of namespaced to the algorithm? Or did you mean something else? That would definitely be a reasonable way to do this -- you'd specify a different algorithm in a flag once when running the program, then it will use that algorithm going forward -- but I’ll need to think about how to persist that data beyond just the terminal session… Let me know if you have other ideas in mind! :)

Viewing hashes and timestamps is a great idea- I’ll add that to the output when the verbose flag is used!

I'm not an extended attribute expert, but I've done a some research and testing and here’s how this seems to work: the xattr is file metadata, similar to a file's creation date, and so it behaves in the same way. It stays with the file if you move around or compress it, and generally won’t be deleted unless you go in and do so manually. It is preserved if sending via email or other means as well, and the recipient of the file wouldn’t have to know or care that the xattr exists (unless they are also using bitwrought!). If you back up these files through Time Machine or something similar, they should also be preserved just fine. As an added bonus, macOS already does checksumming on file metadata itself, but not on user files. So, the extended attributes are inherently better protected against bit rot than your files.

Thanks for writing!

oelna commented 1 year ago

For clarity, are you saying the user’s preference could be set as an environment variable and then the xattr path could be sort of namespaced to the algorithm? Or did you mean something else?

Not exactly. I don't know whether that's an issue at all, but it feels wrong to me that there would be multiple attributes with different hashes for one file. I mean, maybe it makes sense, but I thought something like a custom format could be useful for storing used algorithm and hash in the same string.

Like $sha256$cdc43c7e9089a41897b101de70f878bcc575c839f4ad057605a3335f6a601133 or $sha1$de75cd0b025509384608d186cbcceb2ae5061a79

Maybe it's just that I recently read about the Wordpress approach to password storage.

It is a combined string that can include identifiers (of what's to come), salt, and the password hash. It is designed to allow for multiple hash types and backwards/forwards compatibility.

I found that intriguing, but I also see the value in clear separation of concerns here, so one value per field (or attribute). Another approach would be JSON, which I think is reasonably established. That could store algorithm and hash in a way that can be parsed by applications other than bitwrought itself, as a way of future-proofing.

I'm sorry, I'm just thinking out loud here :)

dslittle22 commented 1 year ago

I'd also find it interesting to view the hashed data and timestamps from inside bitwrought. I guess this can be done using xattr, but I thought it made sense to also support viewing of the data written.

Sorry to not mention this before, but I actually forgot about it completely- running bitwrought -v will show you the file hashes, and in the newly published release, will also show you the modified timestamps as well!

I thought something like a custom format could be useful for storing used algorithm and hash in the same string.

That makes sense! One question I'm working through is how all of this presents itself to the user.

Presumably there would be a default hashing algorithm, but you could specify a different one with an argument for any invocation of bitwrought. What should happen if you run bitwrought once with algorithm A, and again with algorithm B? Maybe you get a message that says something like "You previously ran this with A, would you like to 1) continue to use A, or 2) overwrite hash from algorithm A with a hash from algorithm B"?

Having a custom format like $sha1$de75c... would work just fine for this- any time a file's saved hash is compared to a calculated one, the algorithms could be compared and a message like this could be displayed. This could also mean that if you use bitwrought with algorithm B on a file, it could default to this algorithm in the future!

An alternative option that would work just as well would be saving a third xattr: the last used hashing algorithm for this file. This might honestly be simpler to implement- no custom formats or string parsing.

Additionally, what are your thoughts in terms of algorithms to support? Maybe SHA-1 and BLAKE-2, in addition to SHA-256?

oelna commented 1 year ago

Sorry for disappearing! What you're saying makes sense, and frankly, you probably have a better grasp on what the better way to store the data is. I don't have much experience using xattr, so I tend to lean in the direction of simple strings, but separation of concerns is a good argument in favor of multiple attributes.

I like the idea of SHA-1 support, for compatibility. And I recently read on Mastodon about the merits of BLAKE-3 for file hashing, so it may be worth looking into. You write that bitwrought is currently single-threaded, but what does that mean when it comes to using algorithms that can be parallelized? Will that just work or do you have to radically change the application for it?

I'm kind of overwhelmed by the different levels of SHA-3, and I never know which to pick for a certain task. It probably makes sense to support SHA-3, but I would not dare recommend something here.