dropbox / lepton

Lepton is a tool and file format for losslessly compressing JPEGs by an average of 22%.
https://blogs.dropbox.com/tech/2016/07/lepton-image-compression-saving-22-losslessly-from-images-at-15mbs/
Apache License 2.0
5.01k stars 355 forks source link

Question around streaming #79

Closed Melozzola closed 7 years ago

Melozzola commented 7 years ago

On the blog post there is written:

Lepton is a fully streamable format, meaning the decompression can be applied to any file as that file is being transferred over the network. Hence, streaming overlaps the computational work of the decompression with the file transfer itself, hiding latency from the user.

How can I decompress on-fly when the file is transferred over the network? I have my file stored on Amazon S3 and I would like to decompress it while downloading without first storing the entire file on disk. Is this possible with Lepton?

PS: I tried to run lepton in TCP mode, but it looks like I have to upload the whole compressed file before starting getting a response with the decompressed file.

danielrh commented 7 years ago

you are correct that right now the codec doesn't start the decompression until it has the whole file. It would be a pretty straightforward change to make it stream the units of work to each processing thread--- the file format is designed to do it, but we didn't implement the decoder that way for simplicity.

Basically instead of calling MuxReader::read_all in https://github.com/dropbox/lepton/blob/master/src/io/MuxReader.hh it would have to change to using MuxReader::read()... for each thread, and handing the work units to each thread on the respective data pipes... there might need to be a packet protocol between the main thread, decoding from the network, and each worker thread

Melozzola commented 7 years ago

Thank you for the response. I will see if I have time to learn a bit of C++ (my background is java) and make the change you suggested ;-) Anyway, the fact that the decompression doesn't start until the whole file is sent raises a few more questions:

  1. Is the uploaded file kept in memory or stored on disk?
  2. If it is stored on disk, will it be deleted automatically?
  3. With the current implementation, is there any other way to decompress while streaming? What is the suggested set up to have a streamable decompression during the download?

Some questions might be obvious from the code, but given my limited experience in C++ I prefer to have confirmations from the experts, so apologises if there are questions that have obvious responses.

danielrh commented 7 years ago

Hi Melozzola: I've actually implemented the feature to stream the file during download on this branch here: https://github.com/dropbox/lepton/tree/fully_streaming_decompress please let me know if it suits your workload. This was a very difficult refactor

1) if you run lepton with the - argument it only uses pipes and never hits the disk. In fact it has no access to the filesystem after SECCOMP is turned on (before user data is read from the file descriptor) 2) no file is stored on the disk unless you specify a file output as the second argument 3) yes: use the new branch https://github.com/dropbox/lepton/tree/fully_streaming_decompress

let me know if that works! If so I may merge it into master soon

Melozzola commented 7 years ago

Hi Daniel, that's great thank you! I had a first try but I'm not sure I understood how it's working and if the streaming is also working in TCP mode. What I did is:

  1. Compiled lepton from the fully_streaming_decompress branch
  2. run lepton with ./lepton -listen=2020 -
  3. I have a java TCP client that is sending the compressed file to lepton.

The java client uses 2 threads, one for sending and one for receiving the data. What I'm observing is that I need to upload the whole file, shut down the write operation and then I start getting the response with the decompressed data. What I was expecting is that as I write to the socket, chunks of decompressed data started coming back from lepton straight away. Am I doing something wrong?

danielrh commented 7 years ago

Hmm that should work with the design right now. The image does need to be big enough to activate the streaming mode.... eg 2-3 megabytes. Can you send me your test image and .lep files? I will construct a similar test over here and report back. Another thing you can try is to send like 300KB of data...then sleep for ten seconds...then send the rest...and see if that's significantly faster than sending 1KB of data, then sleeping for 10 seconds

Also: it will only stream properly on baseline jpegs--on progressives, it will do the computation, but won't start producing bytes until after the full image is received, since the lepton format stores the data in a different order (analogous to baseline) than the output file format.

Melozzola commented 7 years ago

Hi Daniel, I think you are right, I was using a too small image. I tried with a bigger one and I can see the receiver thread getting data while the sender thread is still uploading, which is great! I will perform some more testing in the next days and post an update if that's ok with you?

danielrh commented 7 years ago

I really appreciate it! Thanks so much for testing this feature :-) I'll keep testing on my end, and if satisfied, intend to merge to master pretty soon, and probably cook up a new release before too long

danielrh commented 7 years ago

pushed to master in eca9b99c1433b2f5dc56076f000f9a53dfeac135; closing for now

Melozzola commented 7 years ago

Hi Daniel, I've noticed you merged the branch into master. Unfortunately I didn't have too much time to test the functionality and the main reason is because I'm trying to write a NIO Java client that is taking advantage of the new streaming functionality.

Currently I'm facing a strange situation where:

I did try to implement my dummy TCP server to validate the NIO Tcp client and it should work.

Any idea why in NIO is not working? Is there any log I can check to identify what's wrong? If you are a bit familiar with java I can provide a little project that allows you to test the 2 different scenarios.

Thank you very much