UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.66k stars 235 forks source link

Support DCT Filter #532

Open aagubanov opened 1 year ago

aagubanov commented 1 year ago

This is the enhancement request. The filter DCT (Discrete Cosine Transform) is not supported, and as a result, some embedded JPG pictures cannot be extracted.

fnatzke commented 1 year ago

@aagubanov , @EliotJones,

Do have this well underway.

The DCT decode itself is well developed (for 1 of 4 modes) however post decode there is a large variation and complexity yet to be addressed including

  1. translating to (around 8 or so) “final” colorspaces 2.(sub)sampling
  2. stretch
  3. masks (alpha/transparency)

Beyond implementation will be testing.

The test matrix is large with a large number of options and combinations. Beyond testing from “cold” hand build test PDFs from scratch is finding “in the wild” examples of “good” and “bad” implementations.

Relates to #484 ; image export from provided PDF starts with DCT filter before usage of ColorSeparation colorspace.

ColorSeparation colorspace itself makes use of a “Tint function” which can be implemented in 4 modes: 0 Sampled function 2 Exponential interpolation function 3 Stitching function 4 PostScript calculator function

These are also well underway however testing these again will be significant.

Have found 11 (public) “in the wild” example PDFs using separation colorspace [it's rare].

DCT (Discrete Cosine Transform) based on ITU-T81 4.5 has four distinct modes of operation with various coding processes:

  1. sequential DCT-based,
  2. progressive DCT-based,
  3. lossless, and
  4. hierarchical. Currently only mode 1 is implemented. PDF spec calls for mode 2 support (but unlikely to be needed in practice for most PDFs). Supports 8-bit grayscale and YCbCr images. Translation to RGB colorspace done. Other colorspaces require work. Supports restart markers.

Adobe Technical Note TN.5116 details additional decode handling (from inside a PDF) including support for App14 "Adobe" Application Segment hint for colorspace transform support. The default is to use the YCC-to-RGB [color]transform. Byte 11 signals color translations of: // 0 = CMYK
// 1== YCCK

8 bit only (16bit or others require down/up sampling to 8 bit; yet to be implemented).

After all post image processing implemented final step will be translating (Device Independent Bitmap) to PNG for final export from library.

So coming but not soon.

BobLd commented 1 year ago

@fnatzke I've implemented the 4 function types in https://github.com/UglyToad/PdfPig/pull/557 and the separation colorspace now loads the actual function

Also, you can find a lot of strange pdf (I'm pretty sure you'll find PDFs using separation colorspace) here https://github.com/pdf-association/pdf-corpora#safedocs-issue-tracker-corpus

fnatzke commented 1 year ago

Well that's a surprise. Think I've recovered. Is there a way we can coordinate contributions?

Published your link and the URLS of PDFs I've found (so 30,000 ~ 50GB) in https://github.com/UglyToad/PdfPig/discussions/302.

For the Urls I've found provided a command line to download in the discussion. Hope it can help someone.

BobLd commented 1 year ago

@fnatzke I'm going to create a discussion where we can coordinate

EliotJones commented 1 year ago

@BobLd do you happen to know the current state of this, is DCT support now complete?

BobLd commented 1 year ago

@EliotJones as far as I know it's not. Not sure if @fnatzke is still working on that or not