Open lwus opened 2 years ago
The speedup is nice, but I wouldn't want to introduce any additional unsafe code to get it. I wonder what the speedup would be like using just bytemuck::pod_align_to_mut
and not changing the allocation size. That would allow applying the faster algorithm for the inner portion of a longer run (or the whole thing if the allocation happens to be aligned) while maintaining the current API.
I removed the UB from_raw_parts
in 13823085bc83a50a16f20d9050311ac6788dfa22. Bytemuck seems to just defer to unsafe { vals.align_to_mut... }
anyway so not sure if there's much advantage (since we're just going between u8 and u32).
My machine shows no practical difference in the speedup. I imagine that almost all allocations schemes ultimately use memory at least aligned to word size but maybe rust runtime does something more specific...
The reason I'd prefer to use bytemuck is so that there's one place proving the safety of the cast, I would like to eventually make this crate forbid(unsafe_code)
by pulling out the &mut str
helper into another crate.
Sounds good, I made the change. Any thoughts on the API for exposing this?
@Nemo157 any thoughts?
Also removed the manual unrolling and that seemed to both improve perf as well and be simpler codewise anyway
My thought that was by not relying on the alignment of the allocation this algorithm could replace the existing single-byte algorithm and give speedups in all cases.
One additional requirement to do that would be adding in testing with a misaligned buffer to ensure that does not get broken.
I think the current api of allowing arbitrary slices as output buffers makes that tricky. Specifically, it's significantly faster for short buffers to decode everything in word-sized chunks but we would need to switch to byte-sized chunks for the slop. We could add some extra checks do a special case dispatch in that case but that also sounds awkward.
I haven’t looked at the code so this may be dumb comment, but alignment could be solved by copying data to and from a properly-aligned buffer on stack. Hopefully no one uses base58 for large inputs so fixed-size stack buffer should handle all sane cases.
I didn't really spend any time thinking about an API so looking for guidance on where the library wants to go with that. This is still an
O(n^2)
algorithm but with a better constant factor https://gmplib.org/manual/Radix-to-Binary. We could potentially implement a better algorithm for very large numbers but that's overkill for my use case.Benchmark on my M1 laptop is roughly