HTTP2: Huffman decoding implementation is inefficient

geoffkizer commented 6 years ago

We are currently implementing Huffman decoding by looping over a table with a list of encoded values for each length. (See Huffman.cs)

This doesn't seem particularly efficient.

Instead, we should implement Huffman decoding by doing table lookups, using keys of 8 bits or something along those lines.

Note that Huffman encoding from servers seems to be pretty common, so this may have real-world perf impact. (Or maybe not... as always, measuring is the best way to know.)

jbhensley commented 6 years ago

What were your thoughts regarding how this might work? I took a look at the code, but it wasn't immediately apparent that the current implementation could be improved much. Here are my observations:

By definition, the most frequently used characters are encoded using the smallest number of bits, so the loop should terminate the majority of the time with a low iteration count given that the table is ordered by length.
Each entry in the decoding table is a different bit length, so it's not possible to turn this into some kind of hash table. You probably could, but keying into it would require extracting the first 5 bits from the value you are decoding and then checking the table followed by extracting the first 6 bits and checking, etc. which is no different than the loop. Also, not all bit lengths exist (9, 16, 17, 18, 19 & 29) so you would want to skip trying to check for values using those, which the current loop does nicely since it uses a table of bit lengths.
The other hash table option would be to generate a massive "rainbow table" like structure where all combinations of bits beyond the significant ones up to your key length are also hashed. In that case, a single 5 bit entry in the existing table would include hashed entries for every possible combination of extra bits out to, say, the max of 30. No more looping through bit lengths to key into the table, but the size would be enormous. Note that I'm assuming all combinations would be unique, but I haven't given this much thought since it's obviously a non-starter.

The only thing I saw in the code that I thought could be improved was the repeated calculation of codeMax even though that value is essentially a constant for each entry in the decoding table. Expanding the table to be a 3-tuple that included codeMax would avoid that, but the cost there is likely very insignificant.

Mind you, I'm new to the code so I could be way off base here. Looking to see if you could clarify this a bit.

danmoseley commented 6 years ago

@geoffkizer thoughts?

4creators commented 6 years ago

IMHO one of the best implementations of Huffman en/decoder is available here: https://github.com/Cyan4973/FiniteStateEntropy

Its written in C but it should be one of the best examples of Huffman implementations. It is in library together with Finite State Entropy en/decoder written using Asymmetric Numeral Systems (Huffman is special case of ASN). ASN is a separate story and I plan to submit API proposal to include zstd compression in .NET Core and will write more than..

geoffkizer commented 6 years ago

The other hash table option would be to generate a massive "rainbow table" like structure where all combinations of bits beyond the significant ones up to your key length are also hashed.

It doesn't need to be a single, massive hash table. It can be a multi-level table. Take the first 8 bits (or 10, or whatever) and use it as the key to the first-level table. This results in either a decoded value and a bit length, or a reference to a second-level table -- if so, take the next 8 bits and index into the second table, etc.

Choosing the best key size for each table involves a tradeoff -- smaller key sizes mean smaller tables but more lookups, and vice-versa. You can tune this as appropriate. With 8 bits, your main table is only 512 bytes (256 entries, 1 byte for decoded value, 1 byte for either bit length or next table id), and the most common values (including all alphanumerics) will be decoded in a single table lookup.

jbhensley commented 6 years ago

I explored that approach briefly as well, but it just seemed like we were back to looping again. A mock up to see if there is any real perf differential would be the way to go here. I might be able to take a crack at that next week.

geoffkizer commented 6 years ago

There's no looping if the value is encoded in 8 bits or less. (Or whatever key size you choose.) And this covers the most common values, including all alphanumerics and common punctuation.

jbhensley commented 6 years ago

Agreed, encoded values up to the chosen key size would be found immediately. Couple more questions:

Is there a compiled list of common request headers you guys have that can be used for perf testing? If not, I can put together my own.
What would be the preference for constructing the hash table in a production environment? Just use a collection initializer so the object is always ready or construct on first use? I assume we'd want to sync lock in the cause of COFU?

geoffkizer commented 6 years ago

Is there a compiled list of common request headers you guys have that can be used for perf testing? If not, I can put together my own.

Not really.

I think a collection initializer is fine. If necessary, it's easy to switch this to a Lazy<T> later.

grant-d commented 6 years ago

The codeMax optimization THAT @jbhensley suggested has an immediate ~10% benefit in micro benchmark testing:

 Method |     Mean |    Error |   StdDev | Scaled | ScaledSD |
------- |---------:|---------:|---------:|-------:|---------:|
    New | 86.32 ns | 1.864 ns | 1.915 ns |   0.89 |     0.05 |
   Core | 97.49 ns | 1.950 ns | 5.069 ns |   1.00 |     0.00 |

Here's the core change, note the manually-computed additional deltaLength member:

private static readonly (int codeLength, int deltaLength, int[] codes)[] s_decodingTable = new[]
{
    (05, 0, new[] { 48, 49, 50, 97, 99, 101, 105, 111, 115, 116 }), // deltaLength must be 0 for related optimizations to work
    (06, 1, new[] { 32, 37, 45, 46, 47, 51, 52, 53, 54, 55, 56, 57, 61, 65, 95, 98, 100, 102, 103, 104, 108, 109, 110, 112, 114, 117 }),
    (07, 1, new[] { 58, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 106, 107, 113, 118, 119, 120, 121, 122 }),
    (08, 1, new[] { 38, 42, 44, 59, 88, 90 }),
    (10, 2, new[] { 33, 34, 40, 41, 63 }),
    (11, 1, new[] { 39, 43, 124 }),
    (12, 1, new[] { 35, 62 }),
    (13, 1, new[] { 0, 36, 64, 91, 93, 126 }),
    (14, 1, new[] { 94, 125 }),
    (15, 1, new[] { 60, 96, 123 }),
    (19, 4, new[] { 92, 195, 208 }),
    (20, 1, new[] { 128, 130, 131, 162, 184, 194, 224, 226 }),
    (21, 1, new[] { 153, 161, 167, 172, 176, 177, 179, 209, 216, 217, 227, 229, 230 }),
    (22, 1, new[] { 129, 132, 133, 134, 136, 146, 154, 156, 160, 163, 164, 169, 170, 173, 178, 181, 185, 186, 187, 189, 190, 196, 198, 228, 232, 233 }),
    (23, 1, new[] { 1, 135, 137, 138, 139, 140, 141, 143, 147, 149, 150, 151, 152, 155, 157, 158, 165, 166, 168, 174, 175, 180, 182, 183, 188, 191, 197, 231, 239 }),
    (24, 1, new[] { 9, 142, 144, 145, 148, 159, 171, 206, 215, 225, 236, 237 }),
    (25, 1, new[] { 199, 207, 234, 235 }),
    (26, 1, new[] { 192, 193, 200, 201, 202, 205, 210, 213, 218, 219, 238, 240, 242, 243, 255 }),
    (27, 1, new[] { 203, 204, 211, 212, 214, 221, 222, 223, 241, 244, 245, 246, 247, 248, 250, 251, 252, 253, 254 }),
    (28, 1, new[] { 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 220, 249 }),
    (30, 2, new[] { 10, 13, 22, 256 })
};

geoffkizer commented 6 years ago

@grant-d What code are you using to test this?

grant-d commented 6 years ago

Code optimization, without changing the algorithm, gets us to ~23% improvement in the micro-bench. (This needs real-world testing to get a measure of any true improvement)

    Method |     Mean |     Error |    StdDev | Scaled | ScaledSD | Allocated |
---------- |---------:|----------:|----------:|-------:|---------:|----------:|
 Optimized | 71.61 ns | 0.8184 ns | 0.7656 ns |   0.77 |     0.02 |       0 B |
    Legacy | 93.17 ns | 1.7961 ns | 2.3977 ns |   1.00 |     0.00 |       0 B |

For example, changed this:

            for (var i = 0; i < s_decodingTable.Length && s_decodingTable[i].codeLength <= validBits; i++)
            {
                (var codeLength, var codes) = s_decodingTable[i];

                if (i > 0)
                {
                    codeMax <<= codeLength - s_decodingTable[i - 1].codeLength;
                }

                codeMax += codes.Length;

To this:

            for (var i = 0; i < s_decodingTable.Length; i++)
            {
                // deltaLength is precomputed codeMax delta
                var (codeLength, deltaLength, codes) = s_decodingTable[i];

                // Move check out of for-loop to leverage cached value for codeLength
                if (codeLength > validBits)
                    break;

                // Mitigate the if (i > 0) branch by ensuring s_decodingTable[0].deltaLength==0
                codeMax = (codeMax << deltaLength) + codes.Length;

grant-d commented 6 years ago

@geoffkizer see HuffmanBench.cs and related units in this PR

grant-d commented 6 years ago

Another example. This:

                var next = (uint)(src[i] << 24 + lastDecodedBits);
                next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
                next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
                next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0)

Can be optimized as follows, since if the first src.Length branch is false, then there's no way the 2nd & 3rd branches can be true:

                var next = (uint)(src[i] << 24 + lastDecodedBits);
                if (i + 1 < src.Length)
                {
                    next |= (uint)(src[i + 1] << 16 + lastDecodedBits);

                    if (i + 2 < src.Length)
                    {
                        next |= (uint)(src[i + 2] << 8 + lastDecodedBits);

                        if (i + 3 < src.Length)
                            next |= (uint)(src[i + 3] << lastDecodedBits);
                    }
                }

jbhensley commented 6 years ago

I have code for using the multi-level table and am currently working on cleaning it up a bit and testing. Decoding becomes simple:

public static int Decode(uint data, int validBits, out int decodedBits)
{
    int byteNumber = 3;                      // grab the most significant byte
    uint virtualDictionaryMask = 0;          // used to key into different "virtual" dictionaries
    DecodingTableEntry entry;
    do
    {
        // extract the working byte
        uint workingByte = data >> (byteNumber * 8) & 0xFF;

        // apply virtual dictionary bitmask
        workingByte |= virtualDictionaryMask;

        // key into the dictionary
        if (!s_decodingDictionary.TryGetValue(workingByte, out entry))
            break;  // we should either get an entry or bitmask for the next virtual dictionary. if we get neither then
                // the bit pattern is not consistent with any entry we have

        // if we get a length back then we have found the decoded value
        if (entry.BitLength > 0)
        {   
            if (entry.BitLength > validBits)
                break;  // we only found a value by incorporating bits beyond the the valid remaining length of the data stream

            decodedBits = entry.BitLength;
            return (int)entry.DecodedValue;
        }
        // otherwise, we have found a mask that lets us key into the next virtual dictionary
        else
        {
            virtualDictionaryMask = entry.DecodedValue;
            byteNumber--;
        }

    } while (entry.BitLength == 0);

    // no luck. signal to caller that we could not decode
    decodedBits = 0;
    return -1;
}

Building the table is a bit ugly and will likely need to be done with a Lazy<T> that calls a method to construct it since the total size is 3840 entries. We could still use a collection initializer and surround it with a #region I suppose. Either way, I'm a little concerned about the memory footprint.

Since there would be 15 tables in total, I took the approach of using bitmasks on the key value of a single table to create several "virtual" ones within it. The "root" table has no mask, so you initially key in with the value you want to decode. If you get a bit length back then you have the decoded value. Otherwise, you have a bitmask to apply to your next byte when keying in again. The bitmask represents the next "vitual" table.

geoffkizer commented 6 years ago

You have more total entries (3840) than necessary.

The fourth-level tables only need 6 bits to index them (max encoded length is 30 bits). That by itself cuts the total number of entries in half, if my math is correct.

Additionally, at least some intermediate tables don't need all 8 bits either. For example, all encodings that start with 11111110 are exactly 10 bits in length, so the second level table here can be exactly 4 entries instead of the 256 you'd need if you indexed with a full 8 bits.

Dealing with all this will complicate the code and isn't necessary to get decent perf results.

jbhensley commented 6 years ago

True, the storage could be optimized. Right now, however, the perf results are not looking very promising. Oddly enough, my test with a pre-computed codeMax is worse which I cannot explain.

Test set consists of 792 lines of actual HTTP request header data. The file is read and Huffman encoded into bytes as part of the setup and then each decoding method gets it's own crack at it, while verifying that correct values are being decoded. I kind of expected something more dramatic than this.

|                                   Method |          Mean |         Error |        StdDev |   Scaled | ScaledSD |   Gen 0 |   Gen 1 | Allocated |
|----------------------------------------- |--------------:|--------------:|--------------:|---------:|---------:|--------:|--------:|----------:|
|                   BuildHuffmanDictionary | 112,665.66 ns | 2,241.0968 ns | 2,096.3232 ns | 6,667.28 |   119.88 | 31.8604 | 10.6201 |  201624 B |
|                   ProcessHeadersWithLoop |      16.90 ns |     0.0089 ns |     0.0074 ns |     1.00 |     0.00 |       - |       - |       0 B |
| ProcessHeadersWithLoopAndComputedCodeMax |      17.39 ns |     0.3766 ns |     0.5637 ns |     1.03 |     0.03 |       - |       - |       0 B |
|             ProcessHeadersWithDictionary |      16.88 ns |     0.3689 ns |     0.3451 ns |     1.00 |     0.02 |       - |       - |       0 B |

For reference, my version of the pre-computed codeMax treats it as a complete constant:

public static readonly (int codeLength, int codeMax, int[] codes)[] s_decodingTable2 = new[]
        {
            (5, 10, new[] { 48, 49, 50, 97, 99, 101, 105, 111, 115, 116 }),
            (6, 46, new[] { 32, 37, 45, 46, 47, 51, 52, 53, 54, 55, 56, 57, 61, 65, 95, 98, 100, 102, 103, 104, 108, 109, 110, 112, 114, 117 }),
            (7, 124, new[] { 58, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 106, 107, 113, 118, 119, 120, 121, 122 }),
            (8, 254,  new[] { 38, 42, 44, 59, 88, 90 }),
            (10, 1021,  new[] { 33, 34, 40, 41, 63 }),
            (11, 2045, new[] { 39, 43, 124 }),
            (12, 4092, new[] { 35, 62 }),
            (13, 8190, new[] { 0, 36, 64, 91, 93, 126 }),
            (14, 16382, new[] { 94, 125 }),
            (15, 32767, new[] { 60, 96, 123 }),
            (19, 524275, new[] { 92, 195, 208 }),
            (20, 1048558, new[] { 128, 130, 131, 162, 184, 194, 224, 226 }),
            (21, 2097129, new[] { 153, 161, 167, 172, 176, 177, 179, 209, 216, 217, 227, 229, 230 }),
            (22, 4194284, new[] { 129, 132, 133, 134, 136, 146, 154, 156, 160, 163, 164, 169, 170, 173, 178, 181, 185, 186, 187, 189, 190, 196, 198, 228, 232, 233 }),
            (23, 8388597, new[] { 1, 135, 137, 138, 139, 140, 141, 143, 147, 149, 150, 151, 152, 155, 157, 158, 165, 166, 168, 174, 175, 180, 182, 183, 188, 191, 197, 231, 239 }),
            (24, 16777206, new[] { 9, 142, 144, 145, 148, 159, 171, 206, 215, 225, 236, 237 }),
            (25, 33554416, new[] { 199, 207, 234, 235 }),
            (26, 67108847, new[] { 192, 193, 200, 201, 202, 205, 210, 213, 218, 219, 238, 240, 242, 243, 255 }),
            (27, 134217713, new[] { 203, 204, 211, 212, 214, 221, 222, 223, 241, 244, 245, 246, 247, 248, 250, 251, 252, 253, 254 }),
            (28, 268435455, new[] { 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 220, 249 }),
            (30, 1073741824, new[] { 10, 13, 22, 256 })
        };

And the loop decoder:

public static int DecodeWithLoopAndCodeMax(uint data, int validBits, out int decodedBits)
{
    for (int i = 0; i < s_decodingTable2.Length && s_decodingTable2[i].codeLength <= validBits; i++)
    {
        (int codeLength, int codeMax, int[] codes) = s_decodingTable2[i];

        int mask = int.MinValue >> (codeLength - 1);
        long masked = (data & mask) >> (32 - codeLength);

        if (masked < codeMax)
        {
            decodedBits = codeLength;
            return codes[codes.Length - (codeMax - masked)];
        }
    }

    decodedBits = 0;
    return -1;
}

geoffkizer commented 6 years ago

Can you link to the full code?

For one thing, you really should run more iterations. All your measured times are < 20ns, which is super small. I'd run enough iterations to have times above 1s, if possible.

jbhensley commented 6 years ago

I'll get the test project up and link, but that will probably be tomorrow.

grant-d commented 6 years ago

Good idea @jbhensley, amongst other tricks that codeMax calc got it ~35% faster than the original. I am also interested in your test cases, mine are trivial. Does ProcessHeadersWithLoop in your tests reflect the original, unchanged, code?

                Method |     Mean |     Error |   StdDev | Scaled | ScaledSD | Allocated |
---------------------- |---------:|----------:|---------:|-------:|---------:|----------:|
 OptimizedCodeAndTable | 49.29 ns | 0.9742 ns | 1.628 ns |   0.65 |     0.02 |       0 B |
          OriginalCode | 75.45 ns | 1.1256 ns | 1.053 ns |   1.00 |     0.00 |       0 B |

jbhensley commented 6 years ago

Test project is up

I'm new to corefx, so I'm struggling a little with the benchmarking. I'll take any pointers you have in regards to that or the code in general.

jbhensley commented 6 years ago

Also, I trimmed duplicate line items out of my headers.txt file so it's a bit smaller now.

grant-d commented 6 years ago

@jbhensley I have created a fork of your project, added my code to it, then cleaned it all up a bit. I have changed the layout of your benchmark, since the enum approach did not allow you to optimize the outer loop. So each algorithm is just a duplicated Huffman<foo>.cs file in its entirety now. I have also added your headers.txt data to all the benchmarks.

Here's the latest results (see the README for details on algorithms).

You are right that the IDictionary approach is slower, I think it's probably due to the overhead of hashing and branching. Perhaps a hard-coded switch for a small subset of high-frequency values might work?

  Method |       Mean |    Error |   StdDev | Scaled | ScaledSD | Allocated |
-------- |-----------:|---------:|---------:|-------:|---------:|----------:|
 DictOpt | 1,281.3 ns | 24.27 ns | 22.70 ns |   1.41 |     0.03 |       0 B |
    Dict | 1,311.6 ns | 24.41 ns | 22.83 ns |   1.45 |     0.03 |       0 B |
 OrigOpt |   822.8 ns | 16.08 ns | 15.79 ns |   0.91 |     0.02 |       0 B |
    Orig |   907.3 ns | 11.78 ns | 10.44 ns |   1.00 |     0.00 |       0 B |

jbhensley commented 6 years ago

Thanks. I'll take a look at what you've done with this.

My benchmarking probably wasn't as valid, but I was pretty much coming up with the same result. The looping implementation (either original or with codeMax pre-calculated) is able to return the top 10 most frequently used values on the first pass through the loop with only a few bit shift operations and numeric calcs/comparisons. The second pass gets you 26 more. I'm betting that hashing and keying isn't able to keep up with how quickly the loop is able to return the first several levels of most frequently used values. Maybe dictionary is faster for the less frequently used where the loop has to make more iterations, but it's not able to gain enough on those to make up for what it loses on the others.

geoffkizer commented 6 years ago

Thanks for posting the code.

Don't use a dictionary for the table lookup. Just use an array. It's much cheaper.

jbhensley commented 6 years ago

I think that might be the ticket. I’ll give that it a shot tomorrow. Should simplify the code a bit as well.

geoffkizer commented 6 years ago

Also -- In case you're not already doing this, you should do a warmup run before the real run, in the same process. Otherwise you may be including things like JIT costs in the run.

And I still think you should do many more iterations. The times above are in the 1ms range. Do 1000x more iterations and get it in the 1s range.

geoffkizer commented 6 years ago

Or maybe you are already doing lots of iterations? I'm not familiar with the benchmark framework here, so not quite sure how it's executing the code.

jbhensley commented 6 years ago

Looks like the framework does all of the warmup for you. I tried giving it a large iteration count and it appears to be running each benchmark method more times, but the average runtime of any given iteration is still small. I could probably bring the numbers up if I added iterations in the method itself with a for loop instead of telling the framework to run it more.

https://github.com/dotnet/BenchmarkDotNet

geoffkizer commented 6 years ago

Unrelated thought:

The fact that the per- symbol Decode method is a separate method isn't ideal. It's only called one place and it's probably large enough that the JIT is not inlining it. MethodImplOptions.AggressiveInlining might help here. Eventually we should probably just fold this into the main Decode routine.

geoffkizer commented 6 years ago

Re this code that @grant-d mentioned above:

                var next = (uint)(src[i] << 24 + lastDecodedBits);
                next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
                next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
                next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0)

I think we can improve this. We are re-reading all 4 bytes from the source buffer each time. We shouldn't need to do this. We should just shift the current value (in next) by the number of decoded bits, and then read byte by byte from the source buffer into next until we've got at least 30 bits.

So initially, we would read the first 4 bytes into next, and set validBits = 32. Then at the end of each loop iteration, we shifit next by decodedBits, subtract decodedBits from validBits, and read from the source buffer until validBits is >= 30.

This gets tricky because the smallest encoded value is 5 bits, so it's possible that decodedBits is 5 and validBits is 32, meaning validBits becomes 27 and we need to read another byte to get it over 30. Unfortunately this means we need 33 bits. Changing next from a uint to a ulong is a quick fix for this, and shouldn't actually matter in practice on a 64 bit machine.

In a table-based lookup scheme, we don't actually need validBits to be >= 30; we just need it to be >= whatever the next table lookup size is (commonly 8). So that would simplify this as well.

geoffkizer commented 6 years ago

Actually, looking at this code, I don't quite understand how it works in certain cases.

Imagine that lastDecodedBits is 7 after the last loop iteration. Then we are only actually going to read (32-7)=25 bits from the source buffer. If the next encoded value has more than 25 bits, it seems like this logic is broken. Am I missing something here?

geoffkizer commented 6 years ago

This points out that we need much better tests for this code :)

We actually don't currently have any tests. At a minimum we should steal the ASP.NET tests. And probably add a bunch of new cases as well. This is all pretty tricky logic and we need thorough tests to ensure we're doing the right thing here.

We'll need these tests before we can check in any perf improvements here.

grant-d commented 6 years ago

On my fork linked above I already included the ASP/Kestrel units. I tried inlining the second method but it made little difference, I will have another look. The buffered read is worth trying too

geoffkizer commented 6 years ago

It might not matter much. Or it might already be inlined. A profile would tell for sure.

geoffkizer commented 6 years ago

On my PR linked above I already included the ASP/Kestrel units.

I saw that. We need to add these to corefx.

geoffkizer commented 6 years ago

There is an issue with lastDecodedBits as I described above.

Fixed in dotnet/corefx#32043. I also added a bunch of tests in that PR. Clearly the existing ones weren't sufficient...

jcdickinson commented 6 years ago

Me and @grant-d have been messing about with this, and have arrived at a completely different algorithm that should be around O(N) (where N = bits). It's probably at the pit of optimization unless a vectorized algorithm is used - which is likely overkill.

Method	Mean	Error	StdDev	Scaled
Jump	606.3 ns	2.460 ns	1.920 ns	0.73
OrigOpt	755.7 ns	6.260 ns	4.887 ns	0.91
Orig	833.2 ns	2.639 ns	2.203 ns	1.00

The idea is to make a binary tree, conceptually this:

class Node {
  public Node Zero;
  public Node One;
  public byte? ZeroByte;
  public byte? OneByte;
}

You traverse the tree, feeding it one bit at a time until either ZeroByte or OneByte contain a value - at which point you emit the byte and reset to the root of the tree. The actual implementation compresses the tree into an array (as binary trees can be represented in flat arrays) of ushort tuples of length two (zero, one). If the MSB of the ushort is set it signifies a terminal/byte (terminate, emit and return to root); otherwise, it indicates the next index in the array to visit.

geoffkizer commented 6 years ago

Interesting. I still suspect the table lookup approach will be faster, though. It's similar to your algorithm, except that you basically feed it a byte at a time instead of a single bit.

jcdickinson commented 6 years ago

I included the dictionary/lookup table in the benchmark, it (bold) takes longer than the original:

Method	Mean	Error	StdDev	Median	Scaled	ScaledSD
Jump	759.1 ns	15.12 ns	36.81 ns	744.6 ns	0.70	0.03
DictOpt	1,439.8 ns	28.75 ns	42.15 ns	1,423.1 ns	1.32	0.04
Dict	1,426.7 ns	18.98 ns	15.85 ns	1,425.6 ns	1.31	0.02
OrigOpt	921.9 ns	18.28 ns	41.26 ns	908.7 ns	0.85	0.04
Orig	1,087.5 ns	18.37 ns	15.34 ns	1,083.4 ns	1.00	0.00

The problem is that the dictionary is hit (include negative tests) for every bit in a symbol after the 4th (5th?) bit. The DFA/Jump mechanism avoids this - every test in the tree is a positive test unless an invalid/terminal symbol is encountered.

jcdickinson commented 6 years ago

@jbhensley unless you can find a faster algorithm, feel free to take the jump table and get your first netfx PR.

jbhensley commented 6 years ago

I just added the array implementation to the test project. In particular this commit.

Benchmark puts it very close to jump, but slightly behind.

|  Method |       Mean |      Error |     StdDev |     Median | Scaled | ScaledSD | Allocated |
|-------- |-----------:|-----------:|-----------:|-----------:|-------:|---------:|----------:|
|    Jump |   617.8 ns |  9.2048 ns |  8.1598 ns |   613.9 ns |   0.65 |     0.03 |       0 B |
| OrigOpt |   796.7 ns |  5.5610 ns |  4.9297 ns |   798.0 ns |   0.84 |     0.03 |       0 B |
| DictOpt | 1,174.0 ns |  1.4705 ns |  1.3035 ns | 1,173.4 ns |   1.24 |     0.05 |       0 B |
|    Dict | 1,213.7 ns | 24.2705 ns | 43.7647 ns | 1,207.8 ns |   1.28 |     0.07 |       0 B |
|    Orig |   950.6 ns | 18.8892 ns | 37.7237 ns |   925.2 ns |   1.00 |     0.00 |       0 B |
|   Array |   650.4 ns |  0.5128 ns |  0.4546 ns |   650.5 ns |   0.69 |     0.03 |       0 B |

jbhensley commented 6 years ago

I haven't touched the overload public static int Decode(byte[] src, int offset, int count, byte[] dst) yet. Looking at that now. The array method processes one byte at a time, so pushing 4 bytes together with

uint next = (uint)(src[i] << 24 + lastDecodedBits);
                next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
                next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
                next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0);

and then splitting them back up again seems like extra work. I'll see if the array implementation can be optimized by refactoring the calling overload.

jcdickinson commented 6 years ago

@jbhensley sent you a PR with microoptimizations, Array is now way faster - nicely done!

I'm not sure if the decode method with ints is used at all (it is public, so I assumed it was) - the jump mechanism ignores it completely for the byte array overload. Maybe leave it as is and look at the byte array overload independently.

grant-d commented 6 years ago

Nice job. With some optimizations on Array I get the following results:

  Method |     Mean |     Error |    StdDev | Scaled | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
    Jump | 605.8 ns |  4.494 ns |  3.752 ns |   0.68 |       0 B |
 OrigOpt | 739.6 ns | 10.508 ns |  9.830 ns |   0.84 |       0 B |
    Orig | 884.4 ns | 12.059 ns | 10.070 ns |   1.00 |       0 B |
   Array | 562.3 ns |  4.022 ns |  3.565 ns |   0.64 |       0 B |

jbhensley commented 6 years ago

Nice. I'll take a look at your changes.

grant-d commented 6 years ago

Correction, even better :-)

  Method |     Mean |     Error |    StdDev | Scaled | ScaledSD | Allocated |
-------- |---------:|----------:|----------:|-------:|---------:|----------:|
    Jump | 643.3 ns | 12.645 ns | 12.986 ns |   0.72 |     0.02 |       0 B |
 OrigOpt | 739.5 ns | 15.968 ns | 15.683 ns |   0.83 |     0.02 |       0 B |
    Orig | 889.0 ns |  8.995 ns |  8.414 ns |   1.00 |     0.00 |       0 B |
   Array | 498.6 ns |  3.865 ns |  3.615 ns |   0.56 |     0.01 |       0 B |

My changes are in the my fork. Please pull my latest (if you want), then I can start working on yours directly.

geoffkizer commented 6 years ago

Nice work. Great to see the results here!

Re this code:

                int value = s_decodingArray[arrayIndex, workingByte];
                 // if the value is positive then we have a pointer into the encoding table
                if (value > -1)
                {
                    var entry = s_encodingTable[value];
                    if (entry.bitLength > validBits)
                        break;  // we only found a value by incorporating bits beyond the the valid remaining length of the data stream
                     decodedBits = s_encodingTable[value].bitLength;
                    return value;   // the index is also the value
                }

You don't need to use s_encodingTable here at all. Just put the bitLength into the decoding table and you can just do one lookup on it.

jbhensley commented 6 years ago

Good point. I'll make that change.

I still want to take a crack at integrating public static int Decode(byte[] src, int offset, int count, byte[] dst) and public static int Decode(uint data, int validBits, out int decodedBits) or at least rewiring the former to take advantage of how the latter is now processing.

geoffkizer commented 6 years ago

Re the construction of next on every loop iteration: I agree, there's probably a win here. See my comment above for suggestions: https://github.com/dotnet/corefx/issues/31751#issuecomment-417535191

jbhensley commented 6 years ago

@grant-d I think I bungled the merge from your fork since I had pulled @jcdickinson's changes earlier and that created conflicts yours. Accidentally pushed to you while trying to resolve. I assume you can just roll that back. Sorry.

dotnet / runtime

HTTP2: Huffman decoding implementation is inefficient #1506