Fix accidentally quadratic JPEG parsing

nh2 commented 1 year ago

Makes JPEG parsing 1000x faster for large pictures (I tried with a 110 MB one).

I found this speedup accidentally while fixing various incorrect uses of the binary package's Get parser API, which caused the Get based functions to not correctly maintain the parser offset (how many bytes were consumed).
Also adds an incremental parsing API.

Please see the individual commit messages, especially the one of the commit titled Jpg: Fix quadratic JPEG parsing, for full details.

Other preparatory refactoring commits are also included.

Copying the main commit's initial message here for easy reading:

Data.ByteString.Lazy's index is O(chunks), not O(1). The default chunk size is 32 KB. Thus, calling that L.index in a +1 loop, as extractScanContent did, caused accidentally quadratic runtime.
This could be easily observed by putting a trace of n into the loop, and observing how printouts get slower over time.
Further, the (ab)use of getRemainingLazyBytes from binary's Data.Binary.Get module resulted in weird semi-lazy behaviour that was both incorrect (not setting the parser offset correctly, generating misleading error offsets and messages) and slow (encouraging this loop over lazy ByteStrings instead of just using the Get parser as intended).
I thoroughly documented the quirks of the existing parseFrames function in its haddocks, and subsequently renamed it to parseFramesSemiLazy.
This commit fixes the above issues by replacing extractScanContent by a normal Get based parser.
Two variants of that fixed parser are added: parseECS and parseECS_simple. (ECS is the proper name for the "scan content" bytes according to the spec.) The simple one is the straightforward translation, the other one a higher-performance implementation that is only ~20% slower than a non-lazy ByteString based loop. Both variants are faster than the original implementation because they are linear, not accidentally quadratic.
parseFrames is replaced by a strict implementation that uses the new, correct parseECS. Compared to the previous parseFrames (and current parseFramesSemiLazy) it also fixes its remaining issues I found:
- It did not correctly check the 0xFF value of the frame marker, instead just skipping 1 Byte.
- It ignored any unparseable garbage at the end of the JPEG, even including truncated (and thus not properly framed) data. I added both as TODOs for the retained legacy implementation. parseFrames was previously unexported from this .Internal module, so this rename has no backwards incompatibility implications.
Backwards compatibility is maintained because the instance Binary JpgImage continues to use parseFramesSemiLazy, for which I've taken care to preserve its existing quirky laziness semantics. I kept it this way for now because due to lack of comments it is unclear to me whether this quirky lazy behaviour was intended, or a complete accident.

Beyond that:

Further exports are added. This allows users to write their own equivalent of parseFrames, for example one that that searches for the first scan header to determine the image dimensions without parsing the whole JPG.
These fixes make the library compatible with sane IO approaches, such as streaming parsing with binary-conduit or other uses of binary's incremental parser input interface.
An equivalence test for parseECS with parseECS_simple is added.

Twinside commented 1 year ago

I'll trust you on that, I don't have the tooling to check it

nh2 commented 1 year ago

@Twinside Here's an easy way to measure, e.g. in GHCi:

import qualified Data.ByteString.Lazy as L
import Data.Binary
import Data.Binary.Get
import qualified Codec.Picture.Jpg.Internal.Types as JPG
:set -XTypeApplications
:set +s

L.readFile "large110MB.jpg" >>= \bs -> return $ case runGetOrFail (get @JPG.JpgImage) bs of { Left (_rest, offset, err) -> Left ("ERROR", offset, err) ; Right (_rest, offset, jpgImage) -> Right (offset, jpgImage `deepseq` ()) }

The previous implementation prints (by means of :set +s which enables timing):

(806.22 secs, 121,282,712 bytes)

Doing the same with the new implementation (parseECS) from this PR prints:

(0.40 secs, 122,191,224 bytes)

So for this case, it is 2000x faster.

For parseECS_simple, I get:

(0.88 secs, 4,729,102,080 bytes)

This is still quite fast, but 2.5x slower than parseECS and doing 20x more allocation.

For the claim

only ~20% slower than a non-lazy ByteString based loop

I simply made a copy of the existing extractScanContent and switched the types from .Lazy to normal ByteString, like this:

extractScanContentStrict :: L.ByteString -> (L.ByteString, L.ByteString)
extractScanContentStrict str_lazy = aux 0
  where !maxi = fromIntegral $ B.length str - 1
        !str = L.toStrict str_lazy

        aux !n | n >= maxi = (L.fromStrict str, L.empty)
               | v == 0xFF && vNext /= 0 && not isReset = (let (a, b) = B.splitAt n str in (L.fromStrict a, L.fromStrict b))
               | otherwise = aux (n + 1)
             where v = {- (if n `mod` 1000000 == 0 then trace (" n = " ++ show n) else id) -} str `B.index` n
                   vNext = str `B.index` (n + 1)
                   isReset = 0xD0 <= vNext && vNext <= 0xD7

For that I obtained:

(0.35 secs, 235,527,808 bytes)

nh2 commented 1 year ago

In the commit I made a claim that getRemainingLazyByteString does not work well with binary's incremental input interface. That can be checked with these commands, using the binary-conduit as an example:

import Conduit
import Data.Conduit.Serialization.Binary -- from `binary-conduit`

-- Only reads a small part of the file:
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getWord8)

-- This reads the entire file via the conduit (which is not lazy IO):
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getRemainingLazyByteString)

nh2 commented 1 year ago

@Twinside

There are other usages of L.index and Lb.index in JuicyPixels that might also be quadratic and that I didn't fix.

For example:

git grep '\.index' | grep -v '\bB\.index\b'

src/Codec/Picture/HDR.hs:  where at n = L.index str . fromIntegral $ idx + n
src/Codec/Picture/HDR.hs:            | otherwise = pure $ L.index inputData (fromIntegral idx)
src/Codec/Picture/Png.hs:      PixelRGBA8 r g b $ Lb.index transpBuffer (fromIntegral ix)

nh2 commented 1 year ago

@Twinside It would be nice if you could tell me if the semi-lazy behaviour of JPEG parsing was accidental or intentional.

If it was accidental, we could consider it a bug, and perhaps switch the implementation of the JPEG parser away from the quirky to the strict one (which so far I haven't done).

NorfairKing commented 1 year ago

This PR seems to introduce a bug.

This code:

 -- Load the image
  dynamicImage <- decodeImage contents
  pure $ imageToJpg 100 dynamicImage

Turns this image: marvin Into this image signal-2022-12-06-110116_002

nh2 commented 1 year ago

I will investigate.

nh2 commented 1 year ago

This should fix it: PR #216

Twinside commented 1 year ago

Twinside / Juicy.Pixels

Fix accidentally quadratic JPEG parsing #215