Non-allocating image loading

lorenzoh commented 2 years ago

In my quest for ever-faster image data pipelines for training neural nets, I've been playing around with the source code to figure out how to reduce allocations when decoding images. I'm writing here to see if my assumptions about how one could go about this are correct and to clear up some questions.

It seems there are two allocations made:

in jpeg_decode, a matrix out is created as the Julia representation of the image data which is returned to the caller
in _jpeg_decode!, a UInt8-vector buf is created which JpegTurbo writes to

Copying some code from jpeg_decode, I've managed to make a method that takes in an out of the correct size and type and uses that instead of allocating it. There are some segfaults when transposing is not handled correctly or the size and type of the buffer aren't correct, but I assume these can be fixed. In any case, removing this allocation cuts memory usage in half. I assume that something similar could be done for the buf allocation; or for CTs that are based on UInt8s anyway (like RGB{N0f8}), maybe even a view will do.

As for the API to use buffered data loading, I was thinking it may be safest to have a Buffer struct that holds a out and buf. Since one often wants to reuse this buffer to load images of differing sizes, these buffers could be grown to the largest encountered image size; and images smaller than the current out buffer returned as views. This could be used something like:

buffer = JpegTurbo.Buffer(RGB{N0f8}, initialsize)
img = JpegTurbo.jpeg_decode!(buffer, file|data)
parent(img) === buffer.out

Does this approach make sense? Is there a simpler way? Am I missing something when it comes to the transposing?

johnnychen94 commented 2 years ago

In my quest for ever-faster image data pipelines for training neural nets

It's glad to know that you're interested in this package 😆 Curious to ask, won't the JPEG compression artifacts make the training network harder? I thought we need some lossless compression format, e.g., HDF/PNG/QOI, to build a more robust pipeline.

for CTs that are based on UInt8s anyway (like RGB{N0f8}), maybe even a view will do.

I also feel this is doable, I left a TODO here for this option but didn't figure out how when I did the initial implementation. https://github.com/johnnychen94/JpegTurbo.jl/blob/33d53e772cb3c11c0f49082d6493b06bec6cbaea/src/decode.jl#L131-L135

Am I missing something when it comes to the transposing?

I guess it's mainly because Julia uses column-major order and libjpeg-turbo uses row-major order. Thus when you preallocate out, the size actually has to be (width, height). See also the permute part at the end: https://github.com/johnnychen94/JpegTurbo.jl/blob/33d53e772cb3c11c0f49082d6493b06bec6cbaea/src/decode.jl#L100-L104

Does this approach make sense?

Yes, the JpegTurbo.Buffer idea sounds good to me. But I'm currently not available to do it in the coming semester; it might be one or two months later for me to handle this. If you want to put up a PR I'd be very glad to review and merge.

lorenzoh commented 2 years ago

Curious to ask, won't the JPEG compression artifacts make the training network harder?

Many image datasets come in .jpg and that's enough information especially if they are stored in larger sizes which can still be read quickly with JpegTurbo.jl using preferred_size. The most destructive thing is if you apply multiple resizes/affine transformations to the same image since the image quality is reduced every time. So it can actually help to not have to presize the dataset.

I also feel this is doable, I left a TODO here for this option but didn't figure out how when I did the initial implementation.

I'll see if I figure this out, but if one is crazy about performance, I guess JpegTurbo.Buffer will be the way to go anyway.

If you want to put up a PR I'd be very glad to review and merge.

That was my plan! Just wanted to check if there are any stumbling blocks I missed.

JuliaIO / JpegTurbo.jl

Non-allocating image loading #23