dlemstra / Magick.NET

The .NET library for ImageMagick
Apache License 2.0
3.42k stars 413 forks source link

ImageMagick CLI behaves differently than equivalent .NET code for PDF to TIF conversion #466

Closed 36PopTarts closed 5 years ago

36PopTarts commented 5 years ago

Prerequisites

System Configuration

Question

Hello, and thanks for writing this library. I'm currently working on a .NET Core service application which processes PDF documents with the Tesseract OCR engine. To do this, it first uses ImageMagick to prep the PDF to a TIF with good quality for OCR processing. I recently replaced another developer, and that developer seemed to know about ImageMagick but not this library, so he was calling ImageMagick through the shell in C# code and passing command line arguments as you normally do.

The command line worked well enough, but a bug was recently brought to my attention where documents would sometimes be rendered with mostly pure-black color after processing -- this was due to the alpha layer being removed on documents which, as best as I can guess, had a pure black layer with an alpha layer on top which "punched holes" around the text, like popping a cardboard cutout from a sheet. That bug is probably not within the scope of this issue; I determined that it seemed to work more consistently across documents which are uploaded to our system if I don't use -alpha Off on images which have more than 2 samples per pixel. But it is ultimately why I made the change to this library.

Anyway, I figured the best way for me to implement that is to make the switch to the Magick.NET library so that I can read the image attributes and decide easily within the application without double-processing an image. The only problem is, now the images do not come out with nearly the same resolution that they did before, even when I run the command line I posted above manually and compare the results to the service. When I run the command line, the resulting .TIF image always has a resolution of 2500x3300, which is plenty high enough quality for OCRing. When I run my code, the .TIF image comes out at 612x792, which is the native dimensions of the MediaBox container property in the PDF document, and the standard size for an image to be printed at letter size with a density of 72 DPI. That's not high enough because we're shooting for 300 DPI.

Here's what that command line looked like: convert -depth 16 -density 300x300 -compress lzw -background white -colorspace RGB -despeckle -flatten -alpha Off "input.pdf" "output.tif" And my code in .NET for converting and saving the image:

using (MagickImageCollection img = new MagickImageCollection(inputPageFilePath))
                {
                    IMagickImage image;

                    if (img.Count > 1)
                        image = img.Flatten();
                    else
                        image = img[0];

                    if (image.ColorType == ColorType.Bilevel)
                    {
                        image.Alpha(AlphaOption.Off);
                    }
                    else
                        image.Alpha(AlphaOption.Remove);

                    image.Density = RESOLUTION_DPI;
                    image.ColorSpace = ColorSpace.RGB;
                    image.BackgroundColor = MagickColor.FromRgb(255, 255, 255);
                    image.Format = _processorConfig.CompressedImageFormat;
                    image.Depth = _processorConfig.CompressedBitsPerSample;
                    image.Despeckle();

                    using (MemoryStream outStream = new MemoryStream())
                    {
                        ImageOptimizer optimizer = new ImageOptimizer();

                        image.Write(outStream);
                        outStream.Seek(0, SeekOrigin.Begin);
                        //optimizer.LosslessCompress(outStream);
                        //outStream.Seek(0, SeekOrigin.Begin);
                        image = new MagickImage(outStream, 
                            new MagickReadSettings() { Format = _processorConfig.CompressedImageFormat });
                        image.Write(outputImageFilePath);
                    }
                }

I commented out the compression call because I thought that was what was causing the loss in quality at first, but then I realized that the resolution was actually much lower by using tiffinfo. Right now the only thing that is different between the CLI and my code as far as I can tell, is that I'm not using compression right now, until I can figure out why the resolution is still native. Tiffinfo also says the density is the same on both versions of the image:

~/sftp$ tiffinfo cli_converted.tif
TIFF Directory at offset 0x209a56 (2136662)
  Image Width: 2550 Image Length: 3300
  Resolution: 300, 300 pixels/inch
  Bits/Sample: 16
  Compression Scheme: LZW
  Photometric Interpretation: RGB color
  Extra Samples: 1<unassoc-alpha>
  FillOrder: msb-to-lsb
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 4
  Rows/Strip: 3300
  Planar Configuration: single image plane
  Page Number: 0-1
  Predictor: horizontal differencing 2 (0x2)

~/sftp$ tiffinfo net_converted.tif
TIFF Directory at offset 0x2c6048 (2908232)
  Image Width: 612 Image Length: 792
  Resolution: 300, 300 pixels/inch
  Bits/Sample: 16
  Compression Scheme: None
  Photometric Interpretation: RGB color
  FillOrder: msb-to-lsb
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 3
  Rows/Strip: 792
  Planar Configuration: single image plane
  Page Number: 0-1
  White Point: 0.3127-0.329
  PrimaryChromaticities: 0.640000,0.330000,0.300000,0.600000,0.150000,0.060000
  ICC Profile: <present>, 2576 bytes

I'm still relatively new to image manipulation in general, although I have worked with PDFs a lot in the past (not for image manipulation however). Is there a resample or resize step I am missing here? I tried resampling to 300 DPI as well through image.Resample(new PointD(300, 300)), but the image came out looking just as terrible. I would post the image files themselves, but unfortunately, our firm does work for the education sector and I simply cannot post these documents. It doesn't help that there are only a limited set of documents which produce the issue I'm trying to fix. If this is not enough information, I can try to find a good sample image that does not have any personal information on it, but that will take time. Any help you could offer to get my .NET code outputting the same resolution images as the CLI tool would be appreciated.

dlemstra commented 5 years ago

Thanks so much for the detailed explanation!

Your code is not the same as the command and the command that you have is an ImageMagick 6 command instead of an ImageMagick 7 command. In ImageMagick 6 it would allow the command in any order but ImageMagick 7 is more strict about the order so it is not that easy to translate to code.

Your primary problem is that you will need to specify the Density before you read the image:

var readSettings = new MagickReadSettings()
{
    Density = RESOLUTION_DPI
};

using (MagickImageCollection img = new MagickImageCollection(inputPageFilePath, readSettings))
{
}

When you read the image like this it will be read at 300 DPI and result in an image with the resolution that you expect.

And if you want to compress the image with LZW compression you will need to do this:

image.Compression = CompressionMethod.LZW;

p.s. Don't use the ImageOptimizer here, it does nothing for TIFF images.

36PopTarts commented 5 years ago

Thank you for clarifying! I was able to produce the exact same images (even the tiffinfo profiles match) using your suggestion. Only one small correction though, it would appear that the Compression property on the IMagickImage interface is read-only, and I found a forum post which states that I should set it through the image.Settings.Compression property instead. That worked and the image was LZW-compressed. I'm very glad that I don't have to use calls to shell command lines in the middle of a C# application anymore!

dlemstra commented 5 years ago

My bad, I keep forgetting that 😄. And happy to hear that you got it working.