Open aagubanov opened 1 year ago
@aagubanov , @EliotJones,
Do have this well underway.
The DCT decode itself is well developed (for 1 of 4 modes) however post decode there is a large variation and complexity yet to be addressed including
Beyond implementation will be testing.
The test matrix is large with a large number of options and combinations. Beyond testing from “cold” hand build test PDFs from scratch is finding “in the wild” examples of “good” and “bad” implementations.
Relates to #484 ; image export from provided PDF starts with DCT filter before usage of ColorSeparation colorspace.
ColorSeparation colorspace itself makes use of a “Tint function” which can be implemented in 4 modes: 0 Sampled function 2 Exponential interpolation function 3 Stitching function 4 PostScript calculator function
These are also well underway however testing these again will be significant.
Have found 11 (public) “in the wild” example PDFs using separation colorspace [it's rare].
DCT (Discrete Cosine Transform) based on ITU-T81 4.5 has four distinct modes of operation with various coding processes:
Adobe Technical Note TN.5116 details additional decode handling (from inside a PDF) including support for App14 "Adobe" Application Segment hint for colorspace transform support.
The default is to use the YCC-to-RGB [color]transform. Byte 11 signals color translations of:
// 0 = CMYK
// 1== YCCK
8 bit only (16bit or others require down/up sampling to 8 bit; yet to be implemented).
After all post image processing implemented final step will be translating (Device Independent Bitmap) to PNG for final export from library.
So coming but not soon.
@fnatzke I've implemented the 4 function types in https://github.com/UglyToad/PdfPig/pull/557 and the separation colorspace now loads the actual function
Also, you can find a lot of strange pdf (I'm pretty sure you'll find PDFs using separation colorspace) here https://github.com/pdf-association/pdf-corpora#safedocs-issue-tracker-corpus
Well that's a surprise. Think I've recovered. Is there a way we can coordinate contributions?
Published your link and the URLS of PDFs I've found (so 30,000 ~ 50GB) in https://github.com/UglyToad/PdfPig/discussions/302.
For the Urls I've found provided a command line to download in the discussion. Hope it can help someone.
@fnatzke I'm going to create a discussion where we can coordinate
@BobLd do you happen to know the current state of this, is DCT support now complete?
@EliotJones as far as I know it's not. Not sure if @fnatzke is still working on that or not
This is the enhancement request. The filter DCT (Discrete Cosine Transform) is not supported, and as a result, some embedded JPG pictures cannot be extracted.