Open geoffkizer opened 6 years ago
What were your thoughts regarding how this might work? I took a look at the code, but it wasn't immediately apparent that the current implementation could be improved much. Here are my observations:
The only thing I saw in the code that I thought could be improved was the repeated calculation of codeMax
even though that value is essentially a constant for each entry in the decoding table. Expanding the table to be a 3-tuple that included codeMax
would avoid that, but the cost there is likely very insignificant.
Mind you, I'm new to the code so I could be way off base here. Looking to see if you could clarify this a bit.
@geoffkizer thoughts?
IMHO one of the best implementations of Huffman en/decoder is available here: https://github.com/Cyan4973/FiniteStateEntropy
Its written in C but it should be one of the best examples of Huffman implementations. It is in library together with Finite State Entropy
en/decoder written using Asymmetric Numeral Systems (Huffman is special case of ASN). ASN is a separate story and I plan to submit API proposal to include zstd
compression in .NET Core and will write more than..
The other hash table option would be to generate a massive "rainbow table" like structure where all combinations of bits beyond the significant ones up to your key length are also hashed.
It doesn't need to be a single, massive hash table. It can be a multi-level table. Take the first 8 bits (or 10, or whatever) and use it as the key to the first-level table. This results in either a decoded value and a bit length, or a reference to a second-level table -- if so, take the next 8 bits and index into the second table, etc.
Choosing the best key size for each table involves a tradeoff -- smaller key sizes mean smaller tables but more lookups, and vice-versa. You can tune this as appropriate. With 8 bits, your main table is only 512 bytes (256 entries, 1 byte for decoded value, 1 byte for either bit length or next table id), and the most common values (including all alphanumerics) will be decoded in a single table lookup.
I explored that approach briefly as well, but it just seemed like we were back to looping again. A mock up to see if there is any real perf differential would be the way to go here. I might be able to take a crack at that next week.
There's no looping if the value is encoded in 8 bits or less. (Or whatever key size you choose.) And this covers the most common values, including all alphanumerics and common punctuation.
Agreed, encoded values up to the chosen key size would be found immediately. Couple more questions:
Is there a compiled list of common request headers you guys have that can be used for perf testing? If not, I can put together my own.
Not really.
I think a collection initializer is fine. If necessary, it's easy to switch this to a Lazy<T>
later.
The codeMax
optimization THAT @jbhensley suggested has an immediate ~10% benefit in micro benchmark testing:
Method | Mean | Error | StdDev | Scaled | ScaledSD |
------- |---------:|---------:|---------:|-------:|---------:|
New | 86.32 ns | 1.864 ns | 1.915 ns | 0.89 | 0.05 |
Core | 97.49 ns | 1.950 ns | 5.069 ns | 1.00 | 0.00 |
Here's the core change, note the manually-computed additional deltaLength
member:
private static readonly (int codeLength, int deltaLength, int[] codes)[] s_decodingTable = new[]
{
(05, 0, new[] { 48, 49, 50, 97, 99, 101, 105, 111, 115, 116 }), // deltaLength must be 0 for related optimizations to work
(06, 1, new[] { 32, 37, 45, 46, 47, 51, 52, 53, 54, 55, 56, 57, 61, 65, 95, 98, 100, 102, 103, 104, 108, 109, 110, 112, 114, 117 }),
(07, 1, new[] { 58, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 106, 107, 113, 118, 119, 120, 121, 122 }),
(08, 1, new[] { 38, 42, 44, 59, 88, 90 }),
(10, 2, new[] { 33, 34, 40, 41, 63 }),
(11, 1, new[] { 39, 43, 124 }),
(12, 1, new[] { 35, 62 }),
(13, 1, new[] { 0, 36, 64, 91, 93, 126 }),
(14, 1, new[] { 94, 125 }),
(15, 1, new[] { 60, 96, 123 }),
(19, 4, new[] { 92, 195, 208 }),
(20, 1, new[] { 128, 130, 131, 162, 184, 194, 224, 226 }),
(21, 1, new[] { 153, 161, 167, 172, 176, 177, 179, 209, 216, 217, 227, 229, 230 }),
(22, 1, new[] { 129, 132, 133, 134, 136, 146, 154, 156, 160, 163, 164, 169, 170, 173, 178, 181, 185, 186, 187, 189, 190, 196, 198, 228, 232, 233 }),
(23, 1, new[] { 1, 135, 137, 138, 139, 140, 141, 143, 147, 149, 150, 151, 152, 155, 157, 158, 165, 166, 168, 174, 175, 180, 182, 183, 188, 191, 197, 231, 239 }),
(24, 1, new[] { 9, 142, 144, 145, 148, 159, 171, 206, 215, 225, 236, 237 }),
(25, 1, new[] { 199, 207, 234, 235 }),
(26, 1, new[] { 192, 193, 200, 201, 202, 205, 210, 213, 218, 219, 238, 240, 242, 243, 255 }),
(27, 1, new[] { 203, 204, 211, 212, 214, 221, 222, 223, 241, 244, 245, 246, 247, 248, 250, 251, 252, 253, 254 }),
(28, 1, new[] { 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 220, 249 }),
(30, 2, new[] { 10, 13, 22, 256 })
};
@grant-d What code are you using to test this?
Code optimization, without changing the algorithm, gets us to ~23% improvement in the micro-bench. (This needs real-world testing to get a measure of any true improvement)
Method | Mean | Error | StdDev | Scaled | ScaledSD | Allocated |
---------- |---------:|----------:|----------:|-------:|---------:|----------:|
Optimized | 71.61 ns | 0.8184 ns | 0.7656 ns | 0.77 | 0.02 | 0 B |
Legacy | 93.17 ns | 1.7961 ns | 2.3977 ns | 1.00 | 0.00 | 0 B |
For example, changed this:
for (var i = 0; i < s_decodingTable.Length && s_decodingTable[i].codeLength <= validBits; i++)
{
(var codeLength, var codes) = s_decodingTable[i];
if (i > 0)
{
codeMax <<= codeLength - s_decodingTable[i - 1].codeLength;
}
codeMax += codes.Length;
To this:
for (var i = 0; i < s_decodingTable.Length; i++)
{
// deltaLength is precomputed codeMax delta
var (codeLength, deltaLength, codes) = s_decodingTable[i];
// Move check out of for-loop to leverage cached value for codeLength
if (codeLength > validBits)
break;
// Mitigate the if (i > 0) branch by ensuring s_decodingTable[0].deltaLength==0
codeMax = (codeMax << deltaLength) + codes.Length;
Another example. This:
var next = (uint)(src[i] << 24 + lastDecodedBits);
next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0)
Can be optimized as follows, since if the first src.Length
branch is false, then there's no way the 2nd & 3rd branches can be true:
var next = (uint)(src[i] << 24 + lastDecodedBits);
if (i + 1 < src.Length)
{
next |= (uint)(src[i + 1] << 16 + lastDecodedBits);
if (i + 2 < src.Length)
{
next |= (uint)(src[i + 2] << 8 + lastDecodedBits);
if (i + 3 < src.Length)
next |= (uint)(src[i + 3] << lastDecodedBits);
}
}
I have code for using the multi-level table and am currently working on cleaning it up a bit and testing. Decoding becomes simple:
public static int Decode(uint data, int validBits, out int decodedBits)
{
int byteNumber = 3; // grab the most significant byte
uint virtualDictionaryMask = 0; // used to key into different "virtual" dictionaries
DecodingTableEntry entry;
do
{
// extract the working byte
uint workingByte = data >> (byteNumber * 8) & 0xFF;
// apply virtual dictionary bitmask
workingByte |= virtualDictionaryMask;
// key into the dictionary
if (!s_decodingDictionary.TryGetValue(workingByte, out entry))
break; // we should either get an entry or bitmask for the next virtual dictionary. if we get neither then
// the bit pattern is not consistent with any entry we have
// if we get a length back then we have found the decoded value
if (entry.BitLength > 0)
{
if (entry.BitLength > validBits)
break; // we only found a value by incorporating bits beyond the the valid remaining length of the data stream
decodedBits = entry.BitLength;
return (int)entry.DecodedValue;
}
// otherwise, we have found a mask that lets us key into the next virtual dictionary
else
{
virtualDictionaryMask = entry.DecodedValue;
byteNumber--;
}
} while (entry.BitLength == 0);
// no luck. signal to caller that we could not decode
decodedBits = 0;
return -1;
}
Building the table is a bit ugly and will likely need to be done with a Lazy<T>
that calls a method to construct it since the total size is 3840 entries. We could still use a collection initializer and surround it with a #region
I suppose. Either way, I'm a little concerned about the memory footprint.
Since there would be 15 tables in total, I took the approach of using bitmasks on the key value of a single table to create several "virtual" ones within it. The "root" table has no mask, so you initially key in with the value you want to decode. If you get a bit length back then you have the decoded value. Otherwise, you have a bitmask to apply to your next byte when keying in again. The bitmask represents the next "vitual" table.
You have more total entries (3840) than necessary.
The fourth-level tables only need 6 bits to index them (max encoded length is 30 bits). That by itself cuts the total number of entries in half, if my math is correct.
Additionally, at least some intermediate tables don't need all 8 bits either. For example, all encodings that start with 11111110 are exactly 10 bits in length, so the second level table here can be exactly 4 entries instead of the 256 you'd need if you indexed with a full 8 bits.
Dealing with all this will complicate the code and isn't necessary to get decent perf results.
True, the storage could be optimized. Right now, however, the perf results are not looking very promising. Oddly enough, my test with a pre-computed codeMax
is worse which I cannot explain.
Test set consists of 792 lines of actual HTTP request header data. The file is read and Huffman encoded into bytes as part of the setup and then each decoding method gets it's own crack at it, while verifying that correct values are being decoded. I kind of expected something more dramatic than this.
| Method | Mean | Error | StdDev | Scaled | ScaledSD | Gen 0 | Gen 1 | Allocated |
|----------------------------------------- |--------------:|--------------:|--------------:|---------:|---------:|--------:|--------:|----------:|
| BuildHuffmanDictionary | 112,665.66 ns | 2,241.0968 ns | 2,096.3232 ns | 6,667.28 | 119.88 | 31.8604 | 10.6201 | 201624 B |
| ProcessHeadersWithLoop | 16.90 ns | 0.0089 ns | 0.0074 ns | 1.00 | 0.00 | - | - | 0 B |
| ProcessHeadersWithLoopAndComputedCodeMax | 17.39 ns | 0.3766 ns | 0.5637 ns | 1.03 | 0.03 | - | - | 0 B |
| ProcessHeadersWithDictionary | 16.88 ns | 0.3689 ns | 0.3451 ns | 1.00 | 0.02 | - | - | 0 B |
For reference, my version of the pre-computed codeMax
treats it as a complete constant:
public static readonly (int codeLength, int codeMax, int[] codes)[] s_decodingTable2 = new[]
{
(5, 10, new[] { 48, 49, 50, 97, 99, 101, 105, 111, 115, 116 }),
(6, 46, new[] { 32, 37, 45, 46, 47, 51, 52, 53, 54, 55, 56, 57, 61, 65, 95, 98, 100, 102, 103, 104, 108, 109, 110, 112, 114, 117 }),
(7, 124, new[] { 58, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 106, 107, 113, 118, 119, 120, 121, 122 }),
(8, 254, new[] { 38, 42, 44, 59, 88, 90 }),
(10, 1021, new[] { 33, 34, 40, 41, 63 }),
(11, 2045, new[] { 39, 43, 124 }),
(12, 4092, new[] { 35, 62 }),
(13, 8190, new[] { 0, 36, 64, 91, 93, 126 }),
(14, 16382, new[] { 94, 125 }),
(15, 32767, new[] { 60, 96, 123 }),
(19, 524275, new[] { 92, 195, 208 }),
(20, 1048558, new[] { 128, 130, 131, 162, 184, 194, 224, 226 }),
(21, 2097129, new[] { 153, 161, 167, 172, 176, 177, 179, 209, 216, 217, 227, 229, 230 }),
(22, 4194284, new[] { 129, 132, 133, 134, 136, 146, 154, 156, 160, 163, 164, 169, 170, 173, 178, 181, 185, 186, 187, 189, 190, 196, 198, 228, 232, 233 }),
(23, 8388597, new[] { 1, 135, 137, 138, 139, 140, 141, 143, 147, 149, 150, 151, 152, 155, 157, 158, 165, 166, 168, 174, 175, 180, 182, 183, 188, 191, 197, 231, 239 }),
(24, 16777206, new[] { 9, 142, 144, 145, 148, 159, 171, 206, 215, 225, 236, 237 }),
(25, 33554416, new[] { 199, 207, 234, 235 }),
(26, 67108847, new[] { 192, 193, 200, 201, 202, 205, 210, 213, 218, 219, 238, 240, 242, 243, 255 }),
(27, 134217713, new[] { 203, 204, 211, 212, 214, 221, 222, 223, 241, 244, 245, 246, 247, 248, 250, 251, 252, 253, 254 }),
(28, 268435455, new[] { 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 220, 249 }),
(30, 1073741824, new[] { 10, 13, 22, 256 })
};
And the loop decoder:
public static int DecodeWithLoopAndCodeMax(uint data, int validBits, out int decodedBits)
{
for (int i = 0; i < s_decodingTable2.Length && s_decodingTable2[i].codeLength <= validBits; i++)
{
(int codeLength, int codeMax, int[] codes) = s_decodingTable2[i];
int mask = int.MinValue >> (codeLength - 1);
long masked = (data & mask) >> (32 - codeLength);
if (masked < codeMax)
{
decodedBits = codeLength;
return codes[codes.Length - (codeMax - masked)];
}
}
decodedBits = 0;
return -1;
}
Can you link to the full code?
For one thing, you really should run more iterations. All your measured times are < 20ns, which is super small. I'd run enough iterations to have times above 1s, if possible.
I'll get the test project up and link, but that will probably be tomorrow.
Good idea @jbhensley, amongst other tricks that codeMax
calc got it ~35% faster than the original.
I am also interested in your test cases, mine are trivial.
Does ProcessHeadersWithLoop
in your tests reflect the original, unchanged, code?
Method | Mean | Error | StdDev | Scaled | ScaledSD | Allocated |
---------------------- |---------:|----------:|---------:|-------:|---------:|----------:|
OptimizedCodeAndTable | 49.29 ns | 0.9742 ns | 1.628 ns | 0.65 | 0.02 | 0 B |
OriginalCode | 75.45 ns | 1.1256 ns | 1.053 ns | 1.00 | 0.00 | 0 B |
I'm new to corefx, so I'm struggling a little with the benchmarking. I'll take any pointers you have in regards to that or the code in general.
Also, I trimmed duplicate line items out of my headers.txt file so it's a bit smaller now.
@jbhensley I have created a fork of your project, added my code to it, then cleaned it all up a bit. I have changed the layout of your benchmark, since the enum
approach did not allow you to optimize the outer loop. So each algorithm is just a duplicated Huffman<foo>.cs
file in its entirety now. I have also added your headers.txt
data to all the benchmarks.
Here's the latest results (see the README for details on algorithms).
You are right that the IDictionary
approach is slower, I think it's probably due to the overhead of hashing and branching. Perhaps a hard-coded switch
for a small subset of high-frequency values might work?
Method | Mean | Error | StdDev | Scaled | ScaledSD | Allocated |
-------- |-----------:|---------:|---------:|-------:|---------:|----------:|
DictOpt | 1,281.3 ns | 24.27 ns | 22.70 ns | 1.41 | 0.03 | 0 B |
Dict | 1,311.6 ns | 24.41 ns | 22.83 ns | 1.45 | 0.03 | 0 B |
OrigOpt | 822.8 ns | 16.08 ns | 15.79 ns | 0.91 | 0.02 | 0 B |
Orig | 907.3 ns | 11.78 ns | 10.44 ns | 1.00 | 0.00 | 0 B |
Thanks. I'll take a look at what you've done with this.
My benchmarking probably wasn't as valid, but I was pretty much coming up with the same result. The looping implementation (either original or with codeMax
pre-calculated) is able to return the top 10 most frequently used values on the first pass through the loop with only a few bit shift operations and numeric calcs/comparisons. The second pass gets you 26 more. I'm betting that hashing and keying isn't able to keep up with how quickly the loop is able to return the first several levels of most frequently used values. Maybe dictionary is faster for the less frequently used where the loop has to make more iterations, but it's not able to gain enough on those to make up for what it loses on the others.
Thanks for posting the code.
Don't use a dictionary for the table lookup. Just use an array. It's much cheaper.
I think that might be the ticket. I’ll give that it a shot tomorrow. Should simplify the code a bit as well.
Also -- In case you're not already doing this, you should do a warmup run before the real run, in the same process. Otherwise you may be including things like JIT costs in the run.
And I still think you should do many more iterations. The times above are in the 1ms range. Do 1000x more iterations and get it in the 1s range.
Or maybe you are already doing lots of iterations? I'm not familiar with the benchmark framework here, so not quite sure how it's executing the code.
Looks like the framework does all of the warmup for you. I tried giving it a large iteration count and it appears to be running each benchmark method more times, but the average runtime of any given iteration is still small. I could probably bring the numbers up if I added iterations in the method itself with a for loop instead of telling the framework to run it more.
Unrelated thought:
The fact that the per- symbol Decode method is a separate method isn't ideal. It's only called one place and it's probably large enough that the JIT is not inlining it. MethodImplOptions.AggressiveInlining might help here. Eventually we should probably just fold this into the main Decode routine.
Re this code that @grant-d mentioned above:
var next = (uint)(src[i] << 24 + lastDecodedBits);
next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0)
I think we can improve this. We are re-reading all 4 bytes from the source buffer each time. We shouldn't need to do this. We should just shift the current value (in next
) by the number of decoded bits, and then read byte by byte from the source buffer into next
until we've got at least 30 bits.
So initially, we would read the first 4 bytes into next
, and set validBits = 32
. Then at the end of each loop iteration, we shifit next by decodedBits
, subtract decodedBits
from validBits
, and read from the source buffer until validBits
is >= 30.
This gets tricky because the smallest encoded value is 5 bits, so it's possible that decodedBits
is 5 and validBits
is 32, meaning validBits
becomes 27 and we need to read another byte to get it over 30. Unfortunately this means we need 33 bits. Changing next
from a uint to a ulong
is a quick fix for this, and shouldn't actually matter in practice on a 64 bit machine.
In a table-based lookup scheme, we don't actually need validBits
to be >= 30; we just need it to be >= whatever the next table lookup size is (commonly 8). So that would simplify this as well.
Actually, looking at this code, I don't quite understand how it works in certain cases.
Imagine that lastDecodedBits
is 7 after the last loop iteration. Then we are only actually going to read (32-7)=25 bits from the source buffer. If the next encoded value has more than 25 bits, it seems like this logic is broken. Am I missing something here?
This points out that we need much better tests for this code :)
We actually don't currently have any tests. At a minimum we should steal the ASP.NET tests. And probably add a bunch of new cases as well. This is all pretty tricky logic and we need thorough tests to ensure we're doing the right thing here.
We'll need these tests before we can check in any perf improvements here.
On my fork linked above I already included the ASP/Kestrel units. I tried inlining the second method but it made little difference, I will have another look. The buffered read is worth trying too
It might not matter much. Or it might already be inlined. A profile would tell for sure.
On my PR linked above I already included the ASP/Kestrel units.
I saw that. We need to add these to corefx.
There is an issue with lastDecodedBits
as I described above.
Fixed in dotnet/corefx#32043. I also added a bunch of tests in that PR. Clearly the existing ones weren't sufficient...
Me and @grant-d have been messing about with this, and have arrived at a completely different algorithm that should be around O(N) (where N = bits). It's probably at the pit of optimization unless a vectorized algorithm is used - which is likely overkill.
Method | Mean | Error | StdDev | Scaled | Allocated |
---|---|---|---|---|---|
Jump | 606.3 ns | 2.460 ns | 1.920 ns | 0.73 | 0 B |
OrigOpt | 755.7 ns | 6.260 ns | 4.887 ns | 0.91 | 0 B |
Orig | 833.2 ns | 2.639 ns | 2.203 ns | 1.00 | 0 B |
The idea is to make a binary tree, conceptually this:
class Node {
public Node Zero;
public Node One;
public byte? ZeroByte;
public byte? OneByte;
}
You traverse the tree, feeding it one bit at a time until either ZeroByte or OneByte contain a value - at which point you emit the byte and reset to the root of the tree. The actual implementation compresses the tree into an array (as binary trees can be represented in flat arrays) of ushort tuples of length two (zero, one). If the MSB of the ushort is set it signifies a terminal/byte (terminate, emit and return to root); otherwise, it indicates the next index in the array to visit.
Interesting. I still suspect the table lookup approach will be faster, though. It's similar to your algorithm, except that you basically feed it a byte at a time instead of a single bit.
I included the dictionary/lookup table in the benchmark, it (bold) takes longer than the original:
Method | Mean | Error | StdDev | Median | Scaled | ScaledSD | Allocated |
---|---|---|---|---|---|---|---|
Jump | 759.1 ns | 15.12 ns | 36.81 ns | 744.6 ns | 0.70 | 0.03 | 0 B |
DictOpt | 1,439.8 ns | 28.75 ns | 42.15 ns | 1,423.1 ns | 1.32 | 0.04 | 0 B |
Dict | 1,426.7 ns | 18.98 ns | 15.85 ns | 1,425.6 ns | 1.31 | 0.02 | 0 B |
OrigOpt | 921.9 ns | 18.28 ns | 41.26 ns | 908.7 ns | 0.85 | 0.04 | 0 B |
Orig | 1,087.5 ns | 18.37 ns | 15.34 ns | 1,083.4 ns | 1.00 | 0.00 | 0 B |
The problem is that the dictionary is hit (include negative tests) for every bit in a symbol after the 4th (5th?) bit. The DFA/Jump mechanism avoids this - every test in the tree is a positive test unless an invalid/terminal symbol is encountered.
@jbhensley unless you can find a faster algorithm, feel free to take the jump table and get your first netfx PR.
I just added the array implementation to the test project. In particular this commit.
Benchmark puts it very close to jump, but slightly behind.
| Method | Mean | Error | StdDev | Median | Scaled | ScaledSD | Allocated |
|-------- |-----------:|-----------:|-----------:|-----------:|-------:|---------:|----------:|
| Jump | 617.8 ns | 9.2048 ns | 8.1598 ns | 613.9 ns | 0.65 | 0.03 | 0 B |
| OrigOpt | 796.7 ns | 5.5610 ns | 4.9297 ns | 798.0 ns | 0.84 | 0.03 | 0 B |
| DictOpt | 1,174.0 ns | 1.4705 ns | 1.3035 ns | 1,173.4 ns | 1.24 | 0.05 | 0 B |
| Dict | 1,213.7 ns | 24.2705 ns | 43.7647 ns | 1,207.8 ns | 1.28 | 0.07 | 0 B |
| Orig | 950.6 ns | 18.8892 ns | 37.7237 ns | 925.2 ns | 1.00 | 0.00 | 0 B |
| Array | 650.4 ns | 0.5128 ns | 0.4546 ns | 650.5 ns | 0.69 | 0.03 | 0 B |
I haven't touched the overload public static int Decode(byte[] src, int offset, int count, byte[] dst)
yet. Looking at that now. The array method processes one byte at a time, so pushing 4 bytes together with
uint next = (uint)(src[i] << 24 + lastDecodedBits);
next |= (i + 1 < src.Length ? (uint)(src[i + 1] << 16 + lastDecodedBits) : 0);
next |= (i + 2 < src.Length ? (uint)(src[i + 2] << 8 + lastDecodedBits) : 0);
next |= (i + 3 < src.Length ? (uint)(src[i + 3] << lastDecodedBits) : 0);
and then splitting them back up again seems like extra work. I'll see if the array implementation can be optimized by refactoring the calling overload.
@jbhensley sent you a PR with microoptimizations, Array is now way faster - nicely done!
I'm not sure if the decode method with ints is used at all (it is public, so I assumed it was) - the jump mechanism ignores it completely for the byte array overload. Maybe leave it as is and look at the byte array overload independently.
Nice job. With some optimizations on Array
I get the following results:
Method | Mean | Error | StdDev | Scaled | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
Jump | 605.8 ns | 4.494 ns | 3.752 ns | 0.68 | 0 B |
OrigOpt | 739.6 ns | 10.508 ns | 9.830 ns | 0.84 | 0 B |
Orig | 884.4 ns | 12.059 ns | 10.070 ns | 1.00 | 0 B |
Array | 562.3 ns | 4.022 ns | 3.565 ns | 0.64 | 0 B |
Nice. I'll take a look at your changes.
Correction, even better :-)
Method | Mean | Error | StdDev | Scaled | ScaledSD | Allocated |
-------- |---------:|----------:|----------:|-------:|---------:|----------:|
Jump | 643.3 ns | 12.645 ns | 12.986 ns | 0.72 | 0.02 | 0 B |
OrigOpt | 739.5 ns | 15.968 ns | 15.683 ns | 0.83 | 0.02 | 0 B |
Orig | 889.0 ns | 8.995 ns | 8.414 ns | 1.00 | 0.00 | 0 B |
Array | 498.6 ns | 3.865 ns | 3.615 ns | 0.56 | 0.01 | 0 B |
My changes are in the my fork. Please pull my latest (if you want), then I can start working on yours directly.
Nice work. Great to see the results here!
Re this code:
int value = s_decodingArray[arrayIndex, workingByte];
// if the value is positive then we have a pointer into the encoding table
if (value > -1)
{
var entry = s_encodingTable[value];
if (entry.bitLength > validBits)
break; // we only found a value by incorporating bits beyond the the valid remaining length of the data stream
decodedBits = s_encodingTable[value].bitLength;
return value; // the index is also the value
}
You don't need to use s_encodingTable here at all. Just put the bitLength into the decoding table and you can just do one lookup on it.
Good point. I'll make that change.
I still want to take a crack at integrating public static int Decode(byte[] src, int offset, int count, byte[] dst)
and public static int Decode(uint data, int validBits, out int decodedBits)
or at least rewiring the former to take advantage of how the latter is now processing.
Re the construction of next
on every loop iteration: I agree, there's probably a win here. See my comment above for suggestions: https://github.com/dotnet/corefx/issues/31751#issuecomment-417535191
@grant-d I think I bungled the merge from your fork since I had pulled @jcdickinson's changes earlier and that created conflicts yours. Accidentally pushed to you while trying to resolve. I assume you can just roll that back. Sorry.
We are currently implementing Huffman decoding by looping over a table with a list of encoded values for each length. (See Huffman.cs)
This doesn't seem particularly efficient.
Instead, we should implement Huffman decoding by doing table lookups, using keys of 8 bits or something along those lines.
Note that Huffman encoding from servers seems to be pretty common, so this may have real-world perf impact. (Or maybe not... as always, measuring is the best way to know.)