Open aleksandrskrivickis opened 3 months ago
Feel free to give this a shot: https://github.com/TJC-LP/tika-ocr/tree/TJC-LP/enable-xml-output
I'm going to test it in our Databricks workspace in the next few days, but locally seems to work as expected.
Thank you very much. I'm going to test changes proposed now.
Current handler returns plain text. Tika allows more structured output in form of
XML
using ToXMLContentHandler.I propose to introduce optional parameter that would allow XML output if necessary to obtain more strucutred data.