empira / PDFsharp-1.5

A .NET library for processing PDF
MIT License
1.28k stars 588 forks source link

How to decode Hex Strings? #18

Closed r3db closed 7 years ago

r3db commented 7 years ago

I was taking a look at your software and I must say it's very good. So I decided to give it a try...

I have one of these PDFs that contains a vector image inside. I managed to extract the specific stream for the vector content:

This is what I got:

q
1 0 0 1 340.9799957 298.8000031 cm
1 g
0 0 m
20.04 0 l
20.04 -11.46 l
0 -11.46 l
0 0 l
h
f*
Q
BT
/C2_0 10.121 Tf
-0.175 Tc 342.06 289.56 Td
<0004000500060004>Tj
ET

(...)

I'm able to render it. But I'm having some difficulties rendering the text.

For example:

BT
/C2_0 10.121 Tf
-0.175 Tc 342.06 289.56 Td
<0004000500060004>Tj
ET

The hex string does not seem to be a valid string. My guess is that it's and index to the font's code page, in this case the font referred by /C2_0

0004000500060004 => 0004 0005 0006 0004 Depending on the representation I'm assuming 2 bytes per code. I don't know where to check that information, (I know simple font sizes only take one byte)

The question is how can I have access to the font and respective code page information to extract the text.

Or better yet if there's a simpler way to get all of this without me having to parse the vector data myself. Getting the Objects directly... For example PdfLine, PdfText, PdfCircle, etc...

Thanks.

TH-Soft commented 7 years ago

PDFsharp does not parse those items.

PDF files may or may not contain lookup tables for the font indexes.