Closed BhaaLseN closed 1 year ago
@BhaaLseN thanks for the issue and for sharing the test document, I'll have a look shortly
In the PDF, this renders like a regular dash, but I'm not really sure how to address this (as in: what to replace the character with to get as close to the rendered output as possible)
I think the [ACK]
char is rendered as a dash because of the font used. So there's no real way to substitute with something. I see it rendered as a dot on dotnetfiddle: https://dotnetfiddle.net/RmLE4M
I'm working on a fix, should be made available shortly
Interresting. To be fair, I didn't even consider looking at the font there.
Your approach looks like something I could work with. 👍
I briefly considered asking if you could include a way to customize the invalidCharacterHandler
(so I could hook in and write PDF-specific code to return replacements); but then I realized I'd need more than just the string for context (like the Word
, Letter
etc.) to do that. And its probably not worth including that as part of PdfPig when I can just do it afterwards on the already converted XML using a strategy like ConvertToHexadecimal
.
I'm currently looking into ways to get something machine-readable and processable from PDF files, and PdfPig conviniently has a bunch of text exporters that produce either XML (PAGE and Alto) or XHTML (hOCR) which seem like a good fit.
However, one of the first PDF files I tried (which is a pretty old one, and might not fully conform to the spec) contains glyphs that appear to be unprintable characters. With PAGE (through
PageXmlTextExtractor
), I get an XML Serialization error; while the others simply produce a file that is technically invalid (but they do produce output that I can process before continuing).This test file causes the following XML Exception (using basically the code from the Document Layout Analysis wiki entry):
In the PDF, this renders like a regular dash, but I'm not really sure how to address this (as in: what to replace the character with to get as close to the rendered output as possible) and where (since this could be done during reading in
ContentStreamProcessor.ShowText
, inside theLetter
Ctor, or also during serialization inPageXmlTextExtractor
while letters and the text-equivalent are written; which would come with the drawback of having to implement it once for every text extractor).If someone can pick a suitable way forward, I might be able to implement it and open a Pull Request (if time permits).
Just keep in mind that this affects every extractor; but it will cause exceptions for the PAGE extractor. The others simply write the character as-is (for hOCR), try to encode it but fail at runtime (

for SVG) while Alto simply stops processing the block at this point (the text output stops at "TM", and for the original PDF continues with the next detected block)