Nullus157 / bs58-rs

Another Rust Base58 codec implementation
Apache License 2.0
77 stars 24 forks source link

RFC Decode/Encode with multiple bytes per chunk #84

Open lwus opened 2 years ago

lwus commented 2 years ago

I didn't really spend any time thinking about an API so looking for guidance on where the library wants to go with that. This is still an O(n^2) algorithm but with a better constant factor https://gmplib.org/manual/Radix-to-Binary. We could potentially implement a better algorithm for very large numbers but that's overkill for my use case.

Benchmark on my M1 laptop is roughly

  32_bytes/decode_bs58          time:   [414.08 ns 415.13 ns 416.33 ns]
  32_bytes/decode_bs58_unsafe   time:   [99.003 ns 99.456 ns 99.953 ns]
  256_bytes/decode_bs58         time:   [26.940 us 27.022 us 27.118 us]
  256_bytes/decode_bs58_unsafe  time:   [1.3279 us 1.3282 us 1.3286 us]
...
  256_bytes/encode_bs58_vec         time:   [86.649 us 86.698 us 86.763 us]
  256_bytes/encode_bs58_vec_unsafe  time:   [3.5615 us 3.5681 us 3.5762 us]
  32_bytes/encode_bs58_vec          time:   [1.1578 us 1.1612 us 1.1649 us]
  32_bytes/encode_bs58_vec_unsafe   time:   [236.90 ns 237.01 ns 237.13 ns]
Nemo157 commented 2 years ago

The speedup is nice, but I wouldn't want to introduce any additional unsafe code to get it. I wonder what the speedup would be like using just bytemuck::pod_align_to_mut and not changing the allocation size. That would allow applying the faster algorithm for the inner portion of a longer run (or the whole thing if the allocation happens to be aligned) while maintaining the current API.

lwus commented 2 years ago

I removed the UB from_raw_parts in 13823085bc83a50a16f20d9050311ac6788dfa22. Bytemuck seems to just defer to unsafe { vals.align_to_mut... } anyway so not sure if there's much advantage (since we're just going between u8 and u32).

My machine shows no practical difference in the speedup. I imagine that almost all allocations schemes ultimately use memory at least aligned to word size but maybe rust runtime does something more specific...

Nemo157 commented 2 years ago

The reason I'd prefer to use bytemuck is so that there's one place proving the safety of the cast, I would like to eventually make this crate forbid(unsafe_code) by pulling out the &mut str helper into another crate.

lwus commented 2 years ago

Sounds good, I made the change. Any thoughts on the API for exposing this?

lwus commented 2 years ago

@Nemo157 any thoughts?

Also removed the manual unrolling and that seemed to both improve perf as well and be simpler codewise anyway

Nemo157 commented 2 years ago

My thought that was by not relying on the alignment of the allocation this algorithm could replace the existing single-byte algorithm and give speedups in all cases.

One additional requirement to do that would be adding in testing with a misaligned buffer to ensure that does not get broken.

lwus commented 2 years ago

I think the current api of allowing arbitrary slices as output buffers makes that tricky. Specifically, it's significantly faster for short buffers to decode everything in word-sized chunks but we would need to switch to byte-sized chunks for the slop. We could add some extra checks do a special case dispatch in that case but that also sounds awkward.

mina86 commented 2 years ago

I haven’t looked at the code so this may be dumb comment, but alignment could be solved by copying data to and from a properly-aligned buffer on stack. Hopefully no one uses base58 for large inputs so fixed-size stack buffer should handle all sane cases.