UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
Apache License 2.0
1.73k stars 241 forks source link

Text exporters do not escape/replace invalid XML/HTML characters #655

Closed BhaaLseN closed 1 year ago

BhaaLseN commented 1 year ago

I'm currently looking into ways to get something machine-readable and processable from PDF files, and PdfPig conviniently has a bunch of text exporters that produce either XML (PAGE and Alto) or XHTML (hOCR) which seem like a good fit.

However, one of the first PDF files I tried (which is a pretty old one, and might not fully conform to the spec) contains glyphs that appear to be unprintable characters. With PAGE (through PageXmlTextExtractor), I get an XML Serialization error; while the others simply produce a file that is technically invalid (but they do produce output that I can process before continuing).

This test file causes the following XML Exception (using basically the code from the Document Layout Analysis wiki entry):

  Message=There was an error generating the XML document.
   at System.Xml.Serialization.XmlSerializer.Serialize(XmlWriter xmlWriter, Object o, XmlSerializerNamespaces namespaces, String encodingStyle, String id)
   at System.Xml.Serialization.XmlSerializer.Serialize(XmlWriter xmlWriter, Object o)
   at UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PageXmlTextExporter.Serialize(PageXmlDocument pageXmlDocument) in D:\_dev\PdfPig\src\UglyToad.PdfPig.DocumentLayoutAnalysis\Export\PageXmlTextExporter.cs:line 330
   at UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PageXmlTextExporter.Get(Page page, Boolean includePaths) in D:\_dev\PdfPig\src\UglyToad.PdfPig.DocumentLayoutAnalysis\Export\PageXmlTextExporter.cs:line 104
   at UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PageXmlTextExporter.Get(Page page) in D:\_dev\PdfPig\src\UglyToad.PdfPig.DocumentLayoutAnalysis\Export\PageXmlTextExporter.cs:line 73
   at PdfPigTest.Program.Main(String[] args) in D:\_dev\PdfPigTest\Program.cs:line 43

  This exception was originally thrown at this call stack:
    System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(int, byte*, bool)
    System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(char*, char*)
    System.Xml.Serialization.XmlSerializationWriter.WriteElementString(string, string, string, System.Xml.XmlQualifiedName)
    System.Xml.Serialization.XmlSerializationWriter.WriteElementString(string, string, string)
    Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationWriterPageXmlDocument.Write48_PageXmlTextEquiv(string, string, UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PAGE.PageXmlDocument.PageXmlTextEquiv, bool, bool)
    Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationWriterPageXmlDocument.Write56_PageXmlGlyph(string, string, UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PAGE.PageXmlDocument.PageXmlGlyph, bool, bool)
    Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationWriterPageXmlDocument.Write59_PageXmlWord(string, string, UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PAGE.PageXmlDocument.PageXmlWord, bool, bool)
    Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationWriterPageXmlDocument.Write60_PageXmlTextLine(string, string, UglyToad.PdfPig.DocumentLayoutAnalysis.Export.PAGE.PageXmlDocument.PageXmlTextLine, bool, bool)
    [Call Stack Truncated]

Inner Exception 1:
ArgumentException: '[ACK]', hexadecimal value 0x06, is an invalid character.

In the PDF, this renders like a regular dash, but I'm not really sure how to address this (as in: what to replace the character with to get as close to the rendered output as possible) and where (since this could be done during reading in ContentStreamProcessor.ShowText, inside the Letter Ctor, or also during serialization in PageXmlTextExtractor while letters and the text-equivalent are written; which would come with the drawback of having to implement it once for every text extractor).

If someone can pick a suitable way forward, I might be able to implement it and open a Pull Request (if time permits).

Just keep in mind that this affects every extractor; but it will cause exceptions for the PAGE extractor. The others simply write the character as-is (for hOCR), try to encode it but fail at runtime ( for SVG) while Alto simply stops processing the block at this point (the text output stops at "TM", and for the original PDF continues with the next detected block)

BobLd commented 1 year ago

@BhaaLseN thanks for the issue and for sharing the test document, I'll have a look shortly

BobLd commented 1 year ago

In the PDF, this renders like a regular dash, but I'm not really sure how to address this (as in: what to replace the character with to get as close to the rendered output as possible)

I think the [ACK] char is rendered as a dash because of the font used. So there's no real way to substitute with something. I see it rendered as a dot on dotnetfiddle: image

I'm working on a fix, should be made available shortly

BhaaLseN commented 1 year ago

Interresting. To be fair, I didn't even consider looking at the font there.

Your approach looks like something I could work with. 👍 I briefly considered asking if you could include a way to customize the invalidCharacterHandler (so I could hook in and write PDF-specific code to return replacements); but then I realized I'd need more than just the string for context (like the Word, Letter etc.) to do that. And its probably not worth including that as part of PdfPig when I can just do it afterwards on the already converted XML using a strategy like ConvertToHexadecimal.