hash binary files - Githubissues

slxcc commented 10 years ago

the message digest class can't process raw binary read via FileReader. An UTF-8 conversion is forced on append via this function (util.js:620) util.ByteStringBuffer.prototype.toString = function() { return util.decodeUtf8(this.bytes()); }; we noticed a lot of movement in this area in the last release (and a bunch of TODOs in the source), is there work in progress on this part ? (Our test used a jpeg image read via FileReader.readAsBinaryString(...))

If the Buffer part are in heavy refactoring we'd like to know because I also plan to subclass the output buffer to use with cipher to consume the output, and not keep all the results in memory at the end of an encrypt operation.

dlongley commented 10 years ago

An UTF-8 conversion is forced on append via this function (util.js:620) util.ByteStringBuffer.prototype.toString = function() { return util.decodeUtf8(this.bytes()); };

You can call util.ByteStringBuffer.prototype.bytes or util.ByteStringBuffer.prototype.getBytes to avoid the UTF-8 conversion.

we noticed a lot of movement in this area in the last release (and a bunch of TODOs in the source), is there work in progress on this part ?

Yes, there's an ongoing effort to define new APIs that are more clear about how/when binary or text encoding happens. There's also a newer buffer implementation that uses TypedArrays internally and makes it easier to work with them as inputs/outputs via the API. Work on this is sporadic at the moment, but we wanted to get it out there (in an experimental state) so that people could see that we will be supporting more HTML5 features, etc.

If the Buffer part are in heavy refactoring we'd like to know because I also plan to subclass the output buffer to use with cipher to consume the output, and not keep all the results in memory at the end of an encrypt operation.

Any ByteBuffer API changes will be backwards-compatible in the first iteration, but will add better support for TypedArrays and various encodings. Some APIs will be deprecated in the near future, but they won't be removed for some time. Any removal of old APIs via subsequent iterations will also result in a minor version bump (we're still not at 1.0, so it wouldn't be a major one).

I wouldn't expect a subclass that simply auto-consumes ciphered output to have to change all that much to accommodate new APIs. Most of the changes revolve around better naming and support for HTML5 features, not in how the classes function.

As a quick note: you already don't have to wait to read the ciphered data during encryption, if you don't want to. You can always call cipher.output.getBytes() -- at any point. You don't have to wait for cipher.finish() (or cipher.update for that matter, but you'd get back an empty result in that case). So it may be that you don't even need the subclass you're describing as the library already works the way it sounds like you want it to. The cipher API is intended to work in streaming scenarios w/low memory usage.

slxcc commented 10 years ago

ok but still, the md.update() does a _input.putBytes(msg); (for exemple in sha256.js:95)

and util.ByteStringBuffer.prototype.putBytes = function(bytes) { // TODO: update to match DataBuffer API this.data += bytes; return this; };

here is the implicit call to toString() from the "+=", so its breaks when i try to hash a binary file with the standard api. I don't see how this could work, even if I was to try to wrap the binary string issued from the FileReader into a ByteStringBuffer myself the concatenation would still trigger the autoboxing to String ?

As for your note, thanks, its very helpful. I'm not decided yet if I will keep the flow control in the loop that feed the cipher or delegate the writing to an output proxy that will trigger it automatically when its fills, but it is good to have options. Just to be clear : if a get the output during the cipher operation with getBytes(), I still need to call explicitly truncate() or equivalents in order to prevent the buffer to grow tot much ?

dlongley commented 10 years ago

The msg variable will not be converted to a string because it is already a string. The documentation should be more clear about this, certainly, but the expected input to the update call is a binary-encoded string. Forge uses binary-encoded strings (or ByteBuffers which are containers for binary-encoded strings that can grow in size/be manipulated with other API calls) for all of its binary-related work. A lack of better documentation here is an artifact stems from the fact that forge originally just used binary-encoded strings everywhere.

I don't see how this could work, even if I was to try to wrap the binary string issued from the FileReader into a ByteStringBuffer myself the concatenation would still trigger the autoboxing to String?

So, to be clear, there is no implicit call to toString() because update expects a string to begin with; the default encoding is 'raw' (which is the same as 'binary'), but you may pass a different encoding (such as 'utf8') as an optional second parameter.

You probably shouldn't be using FileReader.readAsBinaryString at all -- it's been removed from the FileReader API. Instead, use FileReader.readAsArrayBuffer. Then, convert the array buffer to a binary string manually:

FileReader.readAsArrayBuffer(blob);
...
var binaryStr = String.fromCharCode.apply(null, new Uint8Array(FileReader.result));
...
hash.update(binaryStr);

As I mentioned above, there is some ongoing work with encoding APIs, ByteBuffers, etc. to improve API consistency and clarity as well as to bring APIs more in line with what people expect due to their experience with node and newer HTML5 features. Once this work is complete, you'll be able to skip the conversion step in the first example above and just pass an ArrayBuffer or Uint8Array to the update() API.

Just as a note: if you are using a ByteBuffer (ByteStringBuffer), you do this to use update properly:

var buffer = new ByteBuffer();
...
hash.update(buffer.getBytes());

getBytes returns a binary-encoded string.

As for your note, thanks, its very helpful. I'm not decided yet if I will keep the flow control in the loop that feed the cipher or delegate the writing to an output proxy that will trigger it automatically when its fills, but it is good to have options. Just to be clear : if a get the output during the cipher operation with getBytes(), I still need to call explicitly truncate() or equivalents in order to prevent the buffer to grow tot much ?

Sure! If you call getBytes() it will empty the buffer. ByteBuffers are currently implemented as binary-encoded strings, as indicated above, so the actual result will be to completely release the implementation's underlying string for garbage collection. In other words, calling getBytes() will prevent the buffer from growing. There is no reason to call truncate or compact after a call to getBytes.

Just in case you actually do want to prevent the buffer from being emptied, you can call bytes(). This will just return a copy of the binary-encoded string, without modifying the buffer.

slxcc commented 10 years ago

Thanks for your responses, it works. We switched to readAsArrayBuffer() and used the conversion to feed the update. We has previouly an exception thrown from ByteStringBuffer.toString() that was traced back to the += but it may be a misinterpretation. I'll take moment later to try reproduce it with a basic, trimed test case.

gingnam2013 commented 6 years ago

How do we decrypt an encrypted message as encrypted message is not readable.

davidlehn commented 6 years ago

@ingnam2016: Can you rephrase the question? This issue has been closed for ~4 years, if this is related to another topic, please open a new issue. Thanks.

gingnam2013 commented 6 years ago

Thanks @davidlehn, for the help, my issue has been resolved.

digitalbazaar / forge

hash binary files #128