Closed milesgranger closed 3 years ago
Hi,
Generally it's possible, but currently more space is allocated than needed to allow more aggressive optimizations. That means the slice size would not reflect the actual length of the uncompressed output.
The de/compress methods which push
into the Vec would need to be changed copy add the current output pos instead.
This may be too specific to my use case.
However, while I agree optimally and ergonomically, using the existing API is best for dealing with things in Rust. It turns out making a single (over) allocation into Python, doing the compression op, then resizing the python bytearray
to the actual size is quite a bit faster than going through two allocations; one on Rust then another to create and copy that buffer into a Python object.
I understand if it's not worth it; it's already slightly faster than existing Python lz4 bindings, but having this ability would make it decisively faster; at least that's the experience when using snappy b/c it has implemented the Read trait; giving me the option to pre-allocate then resize the buffer (if there is a way to calculate expected size) or read it into a regular &mut Vec<u8>
if unknown. read
vs read_to_end
; but again, not sure it'd be worthwhile here for your project.
I'd formally like to change my request to have de/compress_into
take anything for output which implements Write
. Looking briefly at the source code, not sure how feasible such a request would be; but anyway having the ability to pass &mut [u8]
would still be helpful.
Write
trait support would be part of the lz4 frame format, which is missing currently. Currently only the block format is implemented.
From a performance point of view, the frame API would probably be a little slower, because of some additional overhead and allocations.
Right, that makes sense, suppose I could still very much make use of the original request here and would make use of framed when/if that comes about. Thanks for the clarification between framed and block format here. :+1:
I'd also like having this kind of API. I'd be totally okay with having to over-allocate the buffer, then having the function return how many bytes were actually written. As long as the padding/alignment requirements are documented – or, perhaps even better, a function for calculating the required buffer size is present – I don't think that's a problem.
My more concrete use case is placing the LZ4 outputs into a bump allocator. In that sense, generalizing compress_into
to be
pub fn compress_into<A: std::alloc::Allocator>(input: &[u8], compressed: &mut Vec<u8, A>)
would work as well, however with allocator_api
currently still being nightly-only, that's not really a reasonable thing to request.
@athre0z thanks, good to know there would be multiple use cases. Internally that's already how it works, it over allocates with get_maximum_output_size
, then afterwards sets the actual size of the vector. I thought about it and this api would make sense on the block format even before the frame format, since it could be utilized from the frame format to avoid double allocations.
on the mut_slice branch, it accepts now Sink
as parameter for compress_into. Sink is a &mut u8
slice, with an additional pos information. Not sure if the api design will stay like this or just a &mut u8
is fine.
You can give it a try if you like. Would also be interesting to know if this improves performance for you.
Thanks for implementing this! Can't really compare against previous performance because having to copy the buffer was a reason for not trying to use the library previously, but I can compare against the lz4
crate which I used previously!
LZ4 flex safe compression: 1.32 GiB/s
LZ4 flex unsafe compression: 1.32 GiB/s (8.3% exec time spent doing compression)
lz4
crate compression (level = 1): 1.00GiB/s (33.8% exec time spent doing compression)
LZ4 flex safe compression: 348 MiB/s (70% exec time)
LZ4 flex unsafe compression: 455 MiB/s (61% exec time)
lz4
crate compression (level = 1): 420 MiB/s (64% exec time)
These throughput rates include not only compression, but also processing in my application (a sort of message queue tinkered towards high throughput). I profiled and annotated the amount of CPU time spent within the respective LZ4 implementation. It's worth noting that this is an unfair comparison, because I had been running the lz4
library in framed mode (without noticing it) and it spends quite a bit of of execution time in XXH32_update
, which I suspect is part of the frame header. Nevertheless, the results look very promising!
Do you plan on also reworking the decompression API in the same manner?
Yes, I updated the master branch to also include decompression with the sink. It's even around 10% faster for compressible data with safe-decoding
, which makes sense, since push
has additional overhead to check the capacity which is not required.
I think you should have either use checked-decode
or hash and check the hash (which is part of the frame format, which is not yet implemented), or it could lead to memory corruption if the the data is corrupted or manipulated. Memory corruption would not a big issue if the program is sandboxed like in wasm, but I don't think that's the case for phyton.
Nice, thanks! Yes -- since my application is networked, I already had the checked-decode
enabled. :) It seems to be working great in general, but I noticed that the pos
field in the Sink
object didn't get updated properly. My compressor looks as following:
struct LZ4Compressor(i32);
impl Compressor for LZ4Compressor {
fn compress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
// Write uncompressed payload size.
let len: u32 = input
.len()
.try_into()
.map_err(|_| CompressError::PayloadTooLarge(input.len()))?;
let cap = lz4_flex::block::compress::get_maximum_output_size(input.len()) + 4;
let mut out = Vec::with_capacity_in(cap, *input.allocator());
out.write_u32::<LE>(len).unwrap();
// Write compressed data.
unsafe { out.set_len(cap) };
let actual = lz4_flex::compress_into(&input, &mut (&mut out[4..]).into());
assert!(actual + 4 <= cap);
unsafe { out.set_len(actual + 4) };
Ok(out)
}
fn decompress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
// Read uncompressed payload size.
let mut read_slice = &input[..];
let uncompressed_len = read_slice.read_u32::<LE>().unwrap();
// Decompress buffer.
let mut decompressed = Vec::with_capacity_in(uncompressed_len as _, *input.allocator());
unsafe { decompressed.set_len(uncompressed_len as _) };
let mut sink: lz4_flex::block::Sink = (&mut decompressed[..]).into();
lz4_flex::decompress_into(read_slice, &mut sink).unwrap();
assert_eq!(sink.pos(), uncompressed_len as _);
Ok(decompressed)
}
}
And the assert!
in decompress
fails because pos()
is always 0. Perhaps I'm just misunderstanding the field.
Ah yes, the pos
parameter is not updated currently in the decompression case. It's probably a very niche issue, where decompression succeeds and is smaller than uncompressed_len, but could be covered by checking pos
.
There is currently an issue I'm looking into where overallocation is required in the decompression case, but it should not. I think the logic needs to be changed a little how the fast loop is exited and then the last bytes are copied. This was previously no issue, because of the owned vec.
Just like with the compression case, I wouldn't mind having to over-allocate the decompression buffer in presence of a function calculating the required size.
On a more general note, I think it would be best to keep Sink
out of the public interface. I do understand the purpose, but feel that the public API would preferably abstract this implementation detail away from the user.
I also wonder if std::io::Cursor
could be used internally (which is pretty much the same thing as Sink
, implementing Write
on Cursor<&mut [u8]>
). It might make unsafe writes more complicated, so if you had previously known about this and it simply isn't a good fit, feel very much free to disregard this comment. Sometimes, reusing existing things just isn't a good solution.
#[test]
fn lz4_test() {
let data = vec![4u8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
let encoded = LZ4Compressor.compress(data.clone()).unwrap();
let decoded = LZ4Compressor.decompress(encoded).unwrap();
assert_eq!(data, decoded);
}
Currently fails with OutputTooSmall { expected_size: 0, actual_size: 20 }
. I suppose that's one of the cases where "There is currently an issue I'm looking into where overallocation is required in the decompression case" cases? It looks like the expected_size
is wrong here.
Yes, that's the issue I meant. I agree Sink
should be replaced. It's mainly there to have a smooth transition from the vec methods and Cursor
would have been the better choice. I will consider it in the refactoring.
Over-allocation for decompression is not compatible with the frame format, because it could overwrite parts of other blocks.
@athre0z This should work now, can you retest? I replaced the Sink
api with &mut [u8] + pos.
Sorry for the late response! Thanks, everything seems to be working as expected now, and the interface is a lot less cumbersome. :) The output_pos
argument is slightly redundant (because you can also just do &mut output[output_pos..]
), but I also don't mind it.
Hi!
Thanks for taking the time to put this together, I think it's great! :rocket:
I'm in a situation, where I tell some other lib what size slice I want (pre-allocating it into Python), and then I get a mutable reference to that. However, it seems all de/compress APIs in lz4_flex take or result in
Vec<u8>
.Is it worth-while to add something like mentioned in the title?