Other output image formats

Zanzacar commented 2 years ago

I was reviewing muPDF_explored and noticed that on page 136 the original C API's support TIFF output, which is what I am mainly looking for in my current project. However per line 184 of muPDF.cs it appears we only have an enum for a select set of these values, plus some not support in the original C API. is there a means in which to add in the TIFF format or other image formats into the .NET library?

arklumpus commented 2 years ago

Hi! My understanding is that those are the input formats for images; in fact, that list at page 136 of MuPDF explored corresponds to the MuPDFCore.InputFileTypes enum.

The output formats are described in chapter 14 of MuPDF explored ("Rendered Output Formats"), page 87 and following. These are the same formats supported by MuPDFCore and mutool (one of the "official" programs using MuPDF created by Artifex).

If you need to produce TIFF images (or images in any other format), an option would be to use MuPDF to get the raw image pixel data (e.g. by using one of the overloads of the Render method that return a byte[] or take an IntPtr argument) and then use another library such as ImageSharp to create the TIFF file.

Zanzacar commented 2 years ago

After posting last night. I was afraid that was the case.

What I am currently doing which might not be the most optimal is document.SaveImage(page, zoom, color, PNG format) Then taking that file and processing it with Magick.Net.

It may be faster if I processed it via a byte[] and not write it out to the hard drive and read it back in.

arklumpus commented 2 years ago

I don't know about Magick.Net, but using ImageSharp I think the most efficient way to do it would be something like this:

using MuPDFCore;
using SixLabors.ImageSharp;

// ...

// Initialize MuPDF context.
using MuPDFContext ctx = new MuPDFContext();

// Open PDF document.
using MuPDFDocument doc = new MuPDFDocument(ctx, @"path/to/PDF/file.pdf");

// Page number.
int pageNumber = 0;

// Zoom level at which the page will be rendered.
double zoom = 1.0;

// Get the size of the rendered image (width and height).
RoundedRectangle pageSize = doc.Pages[pageNumber].Bounds.Round(zoom);

// Get the size in bytes of the rendered image (this should be width * height * 3).
int byteSize = doc.GetRenderedSize(pageNumber, zoom, PixelFormats.RGB);

// Allocate the required unmanaged memory.
IntPtr destination = System.Runtime.InteropServices.Marshal.AllocHGlobal(byteSize);

// Render the image to raw pixels in RGB format, saving the results in the memory that has been allocated.
doc.Render(pageNumber, zoom, PixelFormats.RGB, destination);

// We need an unsafe context in order to create a ReadOnlySpan from an IntPtr.
unsafe
{
    // Create the ImageSharp image from the data in unmanaged memory.
    using Image image = Image.LoadPixelData<SixLabors.ImageSharp.PixelFormats.Rgb24>(new ReadOnlySpan<byte>((void*)destination, byteSize), pageSize.Width, pageSize.Height);

    // Save the image as TIFF.
    image.SaveAsTiff(@"path/to/output/file.tiff");
}

// Release the unmanaged memory.
System.Runtime.InteropServices.Marshal.FreeHGlobal(destination);

You will need to compile this using /unsafe.

I assume it will be similar using other graphics libraries, you just need to find a way to load the image data from an IntPtr in your library.

I would recommend using the overloads of the Render method that save the image to an IntPtr: these are faster, because the data is not marshaled. If you instead use the overloads that return a byte[], the library will first save the image to unmanaged memory, and then copy the unmanaged array into a managed byte array. This means that you need twice as much RAM (though only briefly), and you need some time to copy the data as well (though, depending on your use case, this might not make a big difference).

arklumpus / MuPDFCore

Other output image formats #15