aklomp / base64

Fast Base64 stream encoder/decoder in C99, with SIMD acceleration
BSD 2-Clause "Simplified" License
865 stars 162 forks source link

Decoding data containing <NUL> values #114

Closed bruchar1 closed 1 year ago

bruchar1 commented 1 year ago

It seems that if I encode then decode data containing <NUL> characters, the data is correctly encoded and decoded, but base64_decode returns 0 instead of 1. Is that the expected behavior?

aklomp commented 1 year ago

The return value of base64_decode should be independent of the data being processed (obviously). If the data is decoded successfully, then base64_decode should always return 1. If what you state is correct, then this is a pretty big bug.

~What susprises me is that roundtrip encoding and decoding of 0x00 is part of the test suite.~ Edit: that part of the test suite does not include the null terminator in its test.

I also cannot reproduce this bug with the following command line:

echo -ne 'hello\0world' | bin/base64 | bin/base64 -d >/dev/null; echo $?

This prints 0 (success) as it should.

So either this is an obscure bug hiding in one of the optimized SIMD decoders, or something else is going on.

Which codec are you using? Does the test suite run correctly on your machine?

Can you post any code to reproduce the issue?

aklomp commented 1 year ago

I added a very quick and dirty test to do roundtrip encoding on data with \0x00 in it. All tests in CI pass.

bruchar1 commented 1 year ago

I figured it out. My allocated buffer for encoding was not big enough, and it caused the last padding = to be missing.

I guess the readme is missleading:

The buffer in out has been allocated by the caller and is at least 4/3 the size of the input.

The right buffer size is probably (n + 2) / 3 * 4 in integer arithmetic.

BurningEnlightenment commented 1 year ago

TBH I think it would make sense to add two macros or inline functions to do buffer size calculations. As an added bonus it'd be way more readable than to repeat the size calculation everywhere.

it caused the last padding = to be missing.

While we are at it, I think a flag to encode/decode without padding would make sense, too. See also RFC4648 section 3.2.

aklomp commented 1 year ago

Agreed that it makes sense for the library to expose a macro to do the buffer size calculation.

I've been hesitant to add one (or to publish the "ultimate buffer sizing formula" in the sense of the post above) because I'm not positive that all encoders will always stay within the bounds of such a buffer. The SIMD encoders write 16 bytes of output at a time.

Then again, those SIMD encoders always operate on a full, guaranteed 12 bytes of input, so their output should always fit into the output buffer if it's sized according to something like the formula above.

Regarding the suggestion of adding a flag to disable padding: the suggestion makes sense, but I'm wary of adding more behavioral flags at this point. I'd first like to do a comprehensive rewrite to sort out various pain points, and rethink the architecture of this library first.

aklomp commented 1 year ago

I created #116 to discuss the addition of size macros.