UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Extract Image #310

Closed mind-ra closed 3 years ago

mind-ra commented 3 years ago

Hi all.

I have a pdf with multiple pages as the sample image here: test

PdfPig let me extract the text in the page without issue, but where i try to call GetImages() the result is empty. The page Dictionary ToString() is as following:

<Contents, 181 0>, 
<CropBox, [ 0, 0, 498, 708 ]>, 
<MediaBox, [ 0, 0, 498, 708 ]>, 
<Parent, 1765 0>, 
<Resources, <Font, <F1, 1333 0>, <F2, 1341 0>, <F3, 1338 0>, <F4, 1339 0>>,
<ProcSet, [ /PDF, /Text, /ImageB, /ImageC, /ImageI ]>, <XObject, <Xf5, 182 0>>>, 
<Rotate, 90>, 
<Type, /Page>

The only thing out of ordinary I think is the <XObject, <Xf5, 182 0>>, but the Xf5 NameToken don't exist and I cannot extract the data from there.

Can someone give me some pointers?

InusualZ commented 3 years ago

I'm not 100% sure, but what you are seeing in that page may not be a image, It may be hundreds of paths (curves) that compose an "image". That's why you may not be able to extract it.

mind-ra commented 3 years ago

Sure, the "image" is vectorial, so must be paths of some kind. I wonder if a I can extract them and try to find what kind of files they are. I suspect the original file is of some sort of CAD software.

InusualZ commented 3 years ago

Paths in a PDF are not in a special format. The PDF Spec contain instruction specifically for drawing paths (curves).

If they are path, you can go through them like so:

using (var document = PdfDocument.Open(stream))
{
    var page = document.GetPage(1);
    foreach (var path in page.Content.Paths)
    {
        // Do something with the path
    }
}

If I wanted to extract the paths, I would convert them to svg since it's a very common format and fairly simple. From there you can open it anywhere (Browser, Paint, Photoshop, Ilustrator, etc...)

mind-ra commented 3 years ago

Thanks for the tips @InusualZ.

Digging in the code I found the page.ExperimentalAccess.Paths and the SvgTextExporter. I'm testing them out to get what I need.

mind-ra commented 3 years ago

I successfully used page.ExperimentalAccess.Pathsand the SvgTextExporter to extract all the data I needed.

I encountered an unexpected behavior caused by the StringBuilder used for the svg path creation. It is subject to the Culture of the system and I had to set the Thread.CurrentThread.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture; otherwise the svg data is written wrongly (in my case with the comma).

<path d='M 68,93027416592 57,73636789759996 L 67,85095909544 58,36281686479998' fill='none' stroke='rgb(0,0,0)' stroke-width='0,902778' stroke-linecap='round' stroke-linejoin='round'></path>

<path d='M 68.93027416592 57.73636789759996 L 67.85095909544 58.36281686479998' fill='none' stroke='rgb(0,0,0)' stroke-width='0.902778' stroke-linecap='round' stroke-linejoin='round'></path>

I think we can close this issue.

InusualZ commented 3 years ago

Yeah, I don't think that, that class was meant to be used in a general case, so it could contain errors. Happy that I could help.