UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.76k stars 242 forks source link

Support DecodeDCT and separation colorspace #484

Open ilCosmico opened 2 years ago

ilCosmico commented 2 years ago

The attached pdf contains a picture that is read with inverted colors as shown here below image

I started from your sample code and I just add the code for saving bmp file.

using System;
internal static class ExtractImages
{
    public static void Run(string filePath)
    {
        using (var document = PdfDocument.Open(filePath))
        {
            foreach (var page in document.GetPages())
            {
                foreach (var image in page.GetImages())
                {
                    if (!image.TryGetBytes(out var b))
                    {
                        b = image.RawBytes;
                    }

                    var type = string.Empty;
                    switch (image)
                    {
                        case XObjectImage ximg:
                            type = "XObject";
                            break;
                        case InlineImage inline:
                            type = "Inline";
                            break;
                    }

                    Console.WriteLine($"Image with {b.Count} bytes of type '{type}' on page {page.Number}. Location: {image.Bounds}.");

                    try
                    {
                        Image im = Image.FromStream(new MemoryStream(b.ToArray()));

                        string filename = "test.bmp";
                        if (File.Exists(filename))
                            File.Delete(filename);
                        im.Save(filename);
                    }
                    catch (Exception e)
                    {
                        Console.WriteLine(e);

                    }
                }
            }
        }
    }
}

test.pdf

fnatzke commented 2 years ago

Hello @ilCosmico, The inverted image in the PDF is of type XObject has 'Separation' colorspace currently not supported. Associated RawBytes [you have extracted] is a one byte per pixel JPG. Separation color based on black so final is inverted. Have working (but not pretty) code supporting this 'separation'. The not pretty part is support for reading JPG (on .net standard). The extracted PNG image is attached.

test1

fnatzke commented 1 year ago

@ilCosmico and @EliotJones better looking code now complete. Testing underway however still will be a while before able to check in.

fnatzke commented 1 year ago

Example of images using separation colorspace is rare. After accumulating 30,000 PDFs from public links found 202 PDFs with at lease one example image. From these 202 PDFs create a single PDF which copies just the example pages with (at least one) separation image (some pages have serveral).

The single (563 page 278MB) PDF named PdfWithSeveralSeparationImages20230308.PDF

can be found in the ZIP at: https://www.dropbox.com/s/0ec0y5hrtmk78tt/PdfWithSeveralSeparationImagesPageSources20230308.zip?dl=1

In the ZIP is

  1. PdfWithSeveralSeparationImages20230308.PDF

  2. DescriptionOfPdfWithSeveralSeparationImages20230308.txt A comma separated text file describing each page of the PDF Columns: 1 PageNumber within the PdfWithSeveralSeparationImages20230308.PDF 2 ImageNumber the ordinal of the image on the page 3 BitsPerComponent 4 Width 5 Height 6 RawByteSize 7 AltColorSpaceName 8 TintFunctionNumber

  3. PdfWithSeveralSeparationImagesPageSources20230308.txt This describes each page (PdfWithSeveralSeparationImages20230308.PDF) and where the page was copied from. Comma separated text file. Columns:

    1. PageNumber
    2. SourceFileNumber
    3. SourceFileNameWithoutExtention
    4. SourceFilePageNumber
    5. SourceImageNumber

use this together with the following and last file in the ZIP.

  1. PdfWithSeveralSeparationImagesPagePublicUrlSources20230308.txt Simple two column text file for a gIven source file name (without extension) provide the full public URL Comma separated text file. Columns:
    1. SourceFileNameWithoutExtension (column 3 from previous file)
    2. PublicURL

There are some 5640 example images using the separate colorspace which cover all the "Tint Functions" and many alternate colorspaces. Hope it helps someone.

fnatzke commented 1 year ago

Comments on this activity spilled over in to #532. Copied here to put in to context.

ColorSeparation colorspace itself makes use of a “Tint function” which can be implemented in 4 modes: 0 Sampled function 2 Exponential interpolation function 3 Stitching function 4 PostScript calculator function

These are also well underway however testing these again will be significant.

Have found 11 (public) “in the wild” example PDFs using separation colorspace [it's rare].

DCT (Discrete Cosine Transform) based on ITU-T81 4.5 has four distinct modes of operation with various coding processes:

  1. sequential DCT-based,
  2. progressive DCT-based,
  3. lossless, and
  4. hierarchical. Currently only mode 1 is implemented. PDF spec calls for mode 2 support (but unlikely to be needed in practice for most PDFs). Supports 8-bit grayscale and YCbCr images. Translation to RGB colorspace done. Other colorspaces require work. Supports restart markers.

Adobe Technical Note TN.5116 details additional decode handling (from inside a PDF) including support for App14 "Adobe" Application Segment hint for colorspace transform support. The default is to use the YCC-to-RGB [color]transform. Byte 11 signals color translations of: // 0 = CMYK // 1== YCCK

8 bit only (16bit or others require down/up sampling to 8 bit; yet to be implemented).

After all post image processing implemented final step will be translating (Device Independent Bitmap) to PNG for final export from library.

@BobLd assigned writing the PDF Functions ("Tint functions") to himself silently and completed implentation and testing in secret. The separtioncolor space has been updated with the Tint function. Still not sure why. Separation use is so rare.

This issue (#484) raised originally was a little broader which was "Wrong color reading a picture" and was more about image export than just the Separation colorspace so there is still work to do in PngFromPdfImageFactory and ColorSpaceDetailsByteConverter to convert to RGB for PNG render of import export if we go back to the original issue raised .

@BobLd are you going to do that part?

The image export from the PDF supplied (test.pdf) (to be a succes) also needs DecodeDCT (which later was raised separately by someone else as #532 so seems like the place to put process there).

Perhaps someone would be kind enough to rename this issue back please (now Seperation is done).

fnatzke commented 1 year ago

@BobLd

thank you for renaming

rename doesn't match issue raised.

original issue raised was about image export.

although not mentioned beleive author's intention was asking about image.TryGetPng

three things are required:

  1. DCTDecode filter (#532)
  2. Seperation Colorspace
  3. Changes to PngFromPdfImageFactory and ColorSpaceDetailsByteConverter

For example: src\UglyToad.PdfPig\Images\Png\PngFromPdfImageFactory.cs Line 17

image

\src\UglyToad.PdfPig\Images\ColorSpaceDetailsByteConverter.cs

@BobLd I'm check to see if you going to complete changes to PngFromPdfImageFactory and ColorSpaceDetailsByteConverter that part?

BobLd commented 1 year ago

@fnatzke

@BobLd assigned writing the PDF Functions ("Tint functions") to himself silently and completed implentation and testing in secret. The separtioncolor space has been updated with the Tint function. Still not sure why. Separation use is so rare.

Not sure I understand your comment above and how I should take it, there is nothing secret. Regarding the tint function in the Separation color space, this is the definition of a Separation color space. Now we can actually use the function.

@BobLd I'm check to see if you going to complete changes to PngFromPdfImageFactory and ColorSpaceDetailsByteConverter that part?

As mentioned earlier, I've created a discussion here https://github.com/UglyToad/PdfPig/discussions/574 and a project here https://github.com/UglyToad/PdfPig/projects/5 where we can coordinate contributions, as per your request in #532 The short answer to the question is Yes

fnatzke commented 1 year ago

Please take it this way. I wrote I was working on this issue. You did not. I spend many days working on this (on and off over 4 months). It was a very large amount of time wasted. For nothing. Perhaps this was code you had lying around and you just had to makes some test cases and publish. Perhaps you had to write it from scratch. Either way from August last year silent (then spent time in secret creating a PR). The fact that it works irrelevant. It's about being decent.

ilCosmico commented 1 year ago

@ilCosmico and @EliotJones better looking code now complete. Testing underway however still will be a while before able to check in.

@fnatzke any ETA about the release of the build containing the fix? Thanks in advance!

ilCosmico commented 7 months ago

Hello @fnatzke, it's been a year since we last touched base about the testing phase. Any updates on that front? Time flies! 😅

ilCosmico commented 1 month ago

Any news on this?

BobLd commented 1 month ago

@ilCosmico I'm planning to add DCT support in a separate NuGet package shortly via JpegLibrary (hopefully in the next 2 weeks)

I already have a proof of concept here https://github.com/BobLd/PdfPig/tree/develop-caly

ilCosmico commented 1 month ago

@BobLd nice to hear that! Will it be a new NuGet package referenced by PdfPig?

BobLd commented 1 month ago

@ilCosmico the opposite, the new package references PdfPig.

I've release a initial version of the code here: https://github.com/BobLd/UglyToad.PdfPig.Filters.Dct.JpegLibrary. It's also available as a NuGet package (pre-release).

Have a look at the repo's READMe to understand how to use it. I'll simplify the use soon.

ilCosmico commented 1 month ago

@BobLd thanks for the clarification. I tried it, and it works quite well. Do you have an ETA for the official release?

BobLd commented 4 weeks ago

@ilCosmico thanks for the feedback. I first need to release the official release for PdfPig, and the filter NuGet packet will follow up.

This will happen shortly