PSeitz / lz4_flex

Fastest pure Rust implementation of LZ4 compression/decompression.
MIT License
460 stars 31 forks source link

Implement `pub fn de/compress_into_slice(input: &[u8], output: &mut [u8]) -> usize` #11

Closed milesgranger closed 3 years ago

milesgranger commented 3 years ago

Hi!

Thanks for taking the time to put this together, I think it's great! :rocket:

I'm in a situation, where I tell some other lib what size slice I want (pre-allocating it into Python), and then I get a mutable reference to that. However, it seems all de/compress APIs in lz4_flex take or result in Vec<u8>.

Is it worth-while to add something like mentioned in the title?

PSeitz commented 3 years ago

Hi,

Generally it's possible, but currently more space is allocated than needed to allow more aggressive optimizations. That means the slice size would not reflect the actual length of the uncompressed output.

The de/compress methods which push into the Vec would need to be changed copy add the current output pos instead.

milesgranger commented 3 years ago

This may be too specific to my use case.

However, while I agree optimally and ergonomically, using the existing API is best for dealing with things in Rust. It turns out making a single (over) allocation into Python, doing the compression op, then resizing the python bytearray to the actual size is quite a bit faster than going through two allocations; one on Rust then another to create and copy that buffer into a Python object.

I understand if it's not worth it; it's already slightly faster than existing Python lz4 bindings, but having this ability would make it decisively faster; at least that's the experience when using snappy b/c it has implemented the Read trait; giving me the option to pre-allocate then resize the buffer (if there is a way to calculate expected size) or read it into a regular &mut Vec<u8> if unknown. read vs read_to_end; but again, not sure it'd be worthwhile here for your project.

milesgranger commented 3 years ago

I'd formally like to change my request to have de/compress_into take anything for output which implements Write. Looking briefly at the source code, not sure how feasible such a request would be; but anyway having the ability to pass &mut [u8] would still be helpful.

PSeitz commented 3 years ago

Write trait support would be part of the lz4 frame format, which is missing currently. Currently only the block format is implemented.

From a performance point of view, the frame API would probably be a little slower, because of some additional overhead and allocations.

milesgranger commented 3 years ago

Right, that makes sense, suppose I could still very much make use of the original request here and would make use of framed when/if that comes about. Thanks for the clarification between framed and block format here. :+1:

athre0z commented 3 years ago

I'd also like having this kind of API. I'd be totally okay with having to over-allocate the buffer, then having the function return how many bytes were actually written. As long as the padding/alignment requirements are documented – or, perhaps even better, a function for calculating the required buffer size is present – I don't think that's a problem.

My more concrete use case is placing the LZ4 outputs into a bump allocator. In that sense, generalizing compress_into to be

pub fn compress_into<A: std::alloc::Allocator>(input: &[u8], compressed: &mut Vec<u8, A>)

would work as well, however with allocator_api currently still being nightly-only, that's not really a reasonable thing to request.

PSeitz commented 3 years ago

@athre0z thanks, good to know there would be multiple use cases. Internally that's already how it works, it over allocates with get_maximum_output_size, then afterwards sets the actual size of the vector. I thought about it and this api would make sense on the block format even before the frame format, since it could be utilized from the frame format to avoid double allocations.

PSeitz commented 3 years ago

on the mut_slice branch, it accepts now Sink as parameter for compress_into. Sink is a &mut u8 slice, with an additional pos information. Not sure if the api design will stay like this or just a &mut u8 is fine. You can give it a try if you like. Would also be interesting to know if this improves performance for you.

athre0z commented 3 years ago

Thanks for implementing this! Can't really compare against previous performance because having to copy the buffer was a reason for not trying to use the library previously, but I can compare against the lz4 crate which I used previously!

Trivially compressible data

LZ4 flex safe compression: 1.32 GiB/s LZ4 flex unsafe compression: 1.32 GiB/s (8.3% exec time spent doing compression) lz4 crate compression (level = 1): 1.00GiB/s (33.8% exec time spent doing compression)

Mostly incompressible data

LZ4 flex safe compression: 348 MiB/s (70% exec time) LZ4 flex unsafe compression: 455 MiB/s (61% exec time) lz4 crate compression (level = 1): 420 MiB/s (64% exec time)

These throughput rates include not only compression, but also processing in my application (a sort of message queue tinkered towards high throughput). I profiled and annotated the amount of CPU time spent within the respective LZ4 implementation. It's worth noting that this is an unfair comparison, because I had been running the lz4 library in framed mode (without noticing it) and it spends quite a bit of of execution time in XXH32_update, which I suspect is part of the frame header. Nevertheless, the results look very promising!

Do you plan on also reworking the decompression API in the same manner?

PSeitz commented 3 years ago

Yes, I updated the master branch to also include decompression with the sink. It's even around 10% faster for compressible data with safe-decoding, which makes sense, since push has additional overhead to check the capacity which is not required. I think you should have either use checked-decode or hash and check the hash (which is part of the frame format, which is not yet implemented), or it could lead to memory corruption if the the data is corrupted or manipulated. Memory corruption would not a big issue if the program is sandboxed like in wasm, but I don't think that's the case for phyton.

athre0z commented 3 years ago

Nice, thanks! Yes -- since my application is networked, I already had the checked-decode enabled. :) It seems to be working great in general, but I noticed that the pos field in the Sink object didn't get updated properly. My compressor looks as following:

struct LZ4Compressor(i32);

impl Compressor for LZ4Compressor {
    fn compress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
        // Write uncompressed payload size.
        let len: u32 = input
            .len()
            .try_into()
            .map_err(|_| CompressError::PayloadTooLarge(input.len()))?;

        let cap = lz4_flex::block::compress::get_maximum_output_size(input.len()) + 4;
        let mut out = Vec::with_capacity_in(cap, *input.allocator());
        out.write_u32::<LE>(len).unwrap();

        // Write compressed data.
        unsafe { out.set_len(cap) };
        let actual = lz4_flex::compress_into(&input, &mut (&mut out[4..]).into());
        assert!(actual + 4 <= cap);
        unsafe { out.set_len(actual + 4) };

        Ok(out)
    }

    fn decompress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
        // Read uncompressed payload size.
        let mut read_slice = &input[..];
        let uncompressed_len = read_slice.read_u32::<LE>().unwrap();

        // Decompress buffer.
        let mut decompressed = Vec::with_capacity_in(uncompressed_len as _, *input.allocator());
        unsafe { decompressed.set_len(uncompressed_len as _) };
        let mut sink: lz4_flex::block::Sink = (&mut decompressed[..]).into();
        lz4_flex::decompress_into(read_slice, &mut sink).unwrap();
        assert_eq!(sink.pos(), uncompressed_len as _);

        Ok(decompressed)
    }
}

And the assert! in decompress fails because pos() is always 0. Perhaps I'm just misunderstanding the field.

PSeitz commented 3 years ago

Ah yes, the pos parameter is not updated currently in the decompression case. It's probably a very niche issue, where decompression succeeds and is smaller than uncompressed_len, but could be covered by checking pos.

There is currently an issue I'm looking into where overallocation is required in the decompression case, but it should not. I think the logic needs to be changed a little how the fast loop is exited and then the last bytes are copied. This was previously no issue, because of the owned vec.

athre0z commented 3 years ago

Just like with the compression case, I wouldn't mind having to over-allocate the decompression buffer in presence of a function calculating the required size.

On a more general note, I think it would be best to keep Sink out of the public interface. I do understand the purpose, but feel that the public API would preferably abstract this implementation detail away from the user.

I also wonder if std::io::Cursor could be used internally (which is pretty much the same thing as Sink, implementing Write on Cursor<&mut [u8]>). It might make unsafe writes more complicated, so if you had previously known about this and it simply isn't a good fit, feel very much free to disregard this comment. Sometimes, reusing existing things just isn't a good solution.

athre0z commented 3 years ago
#[test]
fn lz4_test() {
    let data = vec![4u8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    let encoded = LZ4Compressor.compress(data.clone()).unwrap();
    let decoded = LZ4Compressor.decompress(encoded).unwrap();
    assert_eq!(data, decoded);
}

Currently fails with OutputTooSmall { expected_size: 0, actual_size: 20 }. I suppose that's one of the cases where "There is currently an issue I'm looking into where overallocation is required in the decompression case" cases? It looks like the expected_size is wrong here.

PSeitz commented 3 years ago

Yes, that's the issue I meant. I agree Sink should be replaced. It's mainly there to have a smooth transition from the vec methods and Cursor would have been the better choice. I will consider it in the refactoring.

Over-allocation for decompression is not compatible with the frame format, because it could overwrite parts of other blocks.

PSeitz commented 3 years ago

@athre0z This should work now, can you retest? I replaced the Sink api with &mut [u8] + pos.

athre0z commented 3 years ago

Sorry for the late response! Thanks, everything seems to be working as expected now, and the interface is a lot less cumbersome. :) The output_pos argument is slightly redundant (because you can also just do &mut output[output_pos..]), but I also don't mind it.