brandondahler / Data.HashFunction

C# library to create a common interface to non-cryptographic hash functions.
MIT License
255 stars 42 forks source link

Support for stream block processing #37

Open DoCode opened 6 years ago

DoCode commented 6 years ago

Provide support for stream block processing, like a default .NET HashAlgorithm:

var sourceStream = ... // From anywhere
var hashAlgorithm = ... // HashFunction

var bufferSize = 8192;
using (Stream stream = new MemoryStream)
{
    var buffer = new byte[bufferSize];
    int bytesRead;
    while ((bytesRead = sourceStream.Read(buffer, 0, buffer.Length)) > 0)
    {
        hashAlgorithm.TransformBlock(buffer, 0, bytesRead, null, 0);

        stream.Write(buffer, 0, bytesRead);

        blobLength += bytesRead;
    }

    hashAlgorithm.TransformFinalBlock(new byte[0], 0, 0);
}
netclectic commented 6 years ago

+1 for something like this.

it would be nice to use an action, similar to whats already happening with the foreach methods in the IUnifiedData, something like this...

            using (var outputStream = new MemoryStream())
            {
                hash = _hash.ComputeHash(inputStream, outputStream.Write);
            }
brandondahler commented 6 years ago

I'm considering that it might make sense to do something like:

IHashValue ComputeHash(Stream inputStream, Stream outputStream, CancellationToken cancellationToken);
IHashValue ComputeHashAsync(Stream inputStream, Stream outputStream, CancellationToken cancellationToken);
netclectic commented 6 years ago

Yep, perfect. I tried out the action method with the xxhash function that I've been using and managed to make it work but having input / output stream would make more sense.

netclectic commented 6 years ago

I had a look through your work WIP, any reason why you didn't add an output stream to the byte array methods?

I made a fork and implemented it on those methods to do some testing with. I can make a PR if you're interested. https://github.com/netclectic/Data.HashFunction/commit/c16e7794d719a55c804a1f3369299043f59c2253

brandondahler commented 5 years ago

I recognize its been over a year, but I'm now taking another look at this.

Use cases to be solved for

Read + calculate hash value

Have a stream of some unknown (possibly large size), for instance from the network or file system. With that stream you want to a) calculate the hash value of the data and b) doing some other processing in the same sized chunk of data, all without reading more than necessary into memory or re-buffering the data.

Write + calculate hash value

Have a stream of some unknown (possibly large size), for instance from the network or file system. With that stream you want to a) calculate the hash value of the data and b) stream that data to some other endpoint, all without reading more than necessary into memory or re-buffering the data.

Current WIP solution

Being a year later, I'm not sure if I actually like my idea of having input/output streams. I think that from a usability standpoint it is awkward and error prone -- streams do not behave strictly like pipes or buffers, they only have a single read/write head and therefore having something simultaneously reading and writing to a stream doesn't make sense.

In the input/output streams case, we solve for the "Write + calculate hash value" use case, but we do not effectively solve for the "Read + calculate hash value" use case.

Thoughts on better solution

I think a better path would be to have underlying support for the type of TransformBlock / FinalizeBlock API which can be used by end consumers, while maintaining our current ComputeHash functionality as well.

Since I will be doing #46 as well as a v3.0, I plan on punting this change to that milestone and making this change dependent on that issue.