Closed mind-ra closed 3 years ago
I'm not 100% sure, but what you are seeing in that page may not be a image, It may be hundreds of paths (curves) that compose an "image". That's why you may not be able to extract it.
Sure, the "image" is vectorial, so must be paths of some kind. I wonder if a I can extract them and try to find what kind of files they are. I suspect the original file is of some sort of CAD software.
Paths in a PDF are not in a special format. The PDF Spec contain instruction specifically for drawing paths (curves).
If they are path, you can go through them like so:
using (var document = PdfDocument.Open(stream))
{
var page = document.GetPage(1);
foreach (var path in page.Content.Paths)
{
// Do something with the path
}
}
If I wanted to extract the paths, I would convert them to svg since it's a very common format and fairly simple. From there you can open it anywhere (Browser, Paint, Photoshop, Ilustrator, etc...)
Thanks for the tips @InusualZ.
Digging in the code I found the page.ExperimentalAccess.Paths
and the SvgTextExporter
.
I'm testing them out to get what I need.
I successfully used page.ExperimentalAccess.Paths
and the SvgTextExporter
to extract all the data I needed.
I encountered an unexpected behavior caused by the StringBuilder
used for the svg path creation.
It is subject to the Culture
of the system and I had to set the Thread.CurrentThread.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;
otherwise the svg data is written wrongly (in my case with the comma).
<path d='M 68,93027416592 57,73636789759996 L 67,85095909544 58,36281686479998' fill='none' stroke='rgb(0,0,0)' stroke-width='0,902778' stroke-linecap='round' stroke-linejoin='round'></path>
<path d='M 68.93027416592 57.73636789759996 L 67.85095909544 58.36281686479998' fill='none' stroke='rgb(0,0,0)' stroke-width='0.902778' stroke-linecap='round' stroke-linejoin='round'></path>
I think we can close this issue.
Yeah, I don't think that, that class was meant to be used in a general case, so it could contain errors. Happy that I could help.
Hi all.
I have a pdf with multiple pages as the sample image here:
PdfPig let me extract the text in the page without issue, but where i try to call GetImages() the result is empty. The page Dictionary ToString() is as following:
The only thing out of ordinary I think is the
<XObject, <Xf5, 182 0>>
, but the Xf5 NameToken don't exist and I cannot extract the data from there.Can someone give me some pointers?