Refactor base64 encoding

BlobCodes commented 1 month ago

This PR rewrites the entire base64-encoding logic. While doing so, it adds methods used to encode (normal, urlsafe and strict) base64 from one IO into another. Also, urlsafe_encode(data, io) now received an optional padding = true parameter to reach feature parity between all ways to encode base64 (buffer to buffer, buffer to IO, IO to IO).

Encoding from buffer to buffer only saw small performance improvements, but everything related to IOs saw very significant performance improvements.

The specs from #14604 have been copied over to this PR, although more should probably be added.

Benchmark Code

For this benchmark, I copied the `base64.cr` file from this repo into `base64blob.cr` and renamed the module inside to `Base64Blob`. This way, both variants can be used in parallel. ### Direct Comparison ```cr require "benchmark" require "base64" require "./base64blob" asma = "a" * 10 amed = "a" * 1000 abig = "a" * 1_000_000 dev_zero = File.new("/dev/zero", "w+") macro bench(name, input, *output) puts "\nBenchmark: #{ {{name}} }" Benchmark.ips do |x| x.report("old") { Base64.encode({{input}}, {{output.splat}}) } x.report("new") { Base64Blob.encode({{input}}, {{output.splat}}) } end end bench("10 byte string to string", asma) bench("1000 byte string to string", amed) bench("1_000_0000 byte string to string", abig) bench("10 byte string to IO", asma, dev_zero) bench("1000 byte string to IO", amed, dev_zero) bench("1_000_0000 byte string to IO", abig, dev_zero) # For comparison with #14604 #bench("10 byte IO to IO", IO::Sized.new(dev_zero, 10), dev_zero) #bench("1000 bytes IO to IO", IO::Sized.new(dev_zero, 1000), dev_zero) #bench("1_000_000 bytes IO to IO", IO::Sized.new(dev_zero, 1_000_000), dev_zero) ``` ### Throughput: ```cr require "benchmark" require "./base64blob.cr" input = File.new("/dev/zero", "r") output = File.new("/dev/zero", "w") input_size = 2**30 # 1GiB Benchmark.ips do |x| x.report("throughput (1GiB)") { Base64Blob.strict_encode(IO::Sized.new(input, input_size), output) } end ```

My Benchmark Results (Fedora 40, Ryzen 3600)

### Direct comparisons ```yaml Benchmark: 10 byte string to string old 34.96M ( 28.61ns) (± 0.34%) 32.0B/op 1.11× slower new 38.97M ( 25.66ns) (± 0.33%) 32.0B/op fastest Benchmark: 1000 byte string to string old 1.14M (874.90ns) (± 1.20%) 2.0kB/op 1.38× slower new 1.58M (632.77ns) (± 0.48%) 2.0kB/op fastest Benchmark: 1_000_0000 byte string to string old 1.43k (697.02µs) (± 0.59%) 1.3MB/op 1.52× slower new 2.19k (457.18µs) (± 1.19%) 1.3MB/op fastest Benchmark: 10 byte string to IO old 2.92M (342.82ns) (± 1.04%) 0.0B/op 1.14× slower new 3.32M (301.40ns) (± 0.51%) 0.0B/op fastest Benchmark: 1000 byte string to IO old 215.74k ( 4.64µs) (± 0.73%) 0.0B/op 6.21× slower new 1.34M (746.07ns) (± 0.64%) 0.0B/op fastest Benchmark: 1_000_0000 byte string to IO old 229.40 ( 4.36ms) (± 0.98%) 0.0B/op 9.29× slower new 2.13k (469.31µs) (± 0.97%) 0.0B/op fastest Benchmark: 10 byte IO to IO old 2.30M (435.64ns) (± 0.23%) 96.0B/op 8.59× slower new 19.72M ( 50.70ns) (± 0.99%) 96.0B/op fastest Benchmark: 1000 bytes IO to IO old 211.53k ( 4.73µs) (± 1.49%) 96.0B/op 8.10× slower new 1.71M (583.84ns) (± 0.49%) 96.0B/op fastest Benchmark: 1_000_000 bytes IO to IO old 231.63 ( 4.32ms) (± 0.59%) 93.0B/op 8.17× slower new 1.89k (528.53µs) (± 0.57%) 96.0B/op fastest ``` > **NOTE: The Buffer-to-IO encoding methods call `#flush` on the output IO, which explains why encoding 10 bytes to IO takes `> 300ns`. Without the explicit flush, this number goes down to `18.63ns`** > **NOTE: The IO-to-IO encoding methods were compared against #14604** ### Throughput Reading from and writing to `/dev/zero`, the code needed `490.62ms` per `1 GiB` (-> 2.04GiB / 2.19GB per second) in strict mode and `565.33ms` using the default `#encode` (-> 1.77GiB / 1.90GB per second)

Closes https://github.com/crystal-lang/crystal/pull/14604

straight-shoota commented 1 month ago

So just to clarify, this replaces https://github.com/crystal-lang/crystal/pull/14604 and supplements #13574 which covers the decoding part. Right?

kostya commented 1 month ago

looks good. check on https://github.com/kostya/crystal-metric 1.12.1

Base64Encode: ok in 0.930s, award 1.0
Base64Decode: ok in 0.942s, award 1.0
1.8721s, 2.00/2, 1.00

this pr

Base64Encode: ok in 0.918s, award 1.0
Base64Decode: ok in 0.937s, award 1.0
1.8552s, 2.00/2, 1.00

BlobCodes commented 1 month ago

So just to clarify, this replaces #14604 and supplements #13574 which covers the decoding part. Right?

Yes, although it is completely independent of #13574

BlobCodes commented 1 month ago

As a note to anyone reviewing this:

In the bulk encoding part (-> #encode_base64_full_pairs_internal), memcpy is used to copy the actual data triples over to the variable:

https://github.com/crystal-lang/crystal/blob/b662f346c90c4d54d78317d249f48e12d7fcec32/src/base64.cr#L297

This was adapted from the previous encoding method: https://github.com/crystal-lang/crystal/blob/f9ffa357069e435b8fb020ade9aff11f4be4d667/src/base64.cr#L236

The old method used an unaligned U32 access with Pointer (which reports correct alignment to LLVM). This could have caused a crash on weak-memory architectures like RISC-V and older ARM versions where unaligned accesses are forbidden.

This can also be seen in the generated IR:

body:                                             ; preds = %while
  %122 = load ptr, ptr %cstr, align 8, !dbg !17334
  %123 = load i32, ptr %122, align 4, !dbg !17334 ; <- Notice the align 4
  %124 = call i32 @"*UInt32#byte_swap:UInt32"(i32 %123), !dbg !17335

The new variant should now always work correctly, but may be less performant than only reading 3 bytes on weak-memory architectures. Hopefully we'll find an even more efficient method for the bulk encoding work once crystal gets SIMD support.

jgaskins commented 1 month ago

This looks great! Thanks for improving on my PR!

crystal-lang / crystal

Refactor base64 encoding #14611