databrickslabs / tika-ocr

Other
17 stars 2 forks source link

No support for ToXMLContentHandler #45

Open aleksandrskrivickis opened 3 months ago

aleksandrskrivickis commented 3 months ago

Current handler returns plain text. Tika allows more structured output in form of XML using ToXMLContentHandler.

I propose to introduce optional parameter that would allow XML output if necessary to obtain more strucutred data.

arcaputo3 commented 2 months ago

Feel free to give this a shot: https://github.com/TJC-LP/tika-ocr/tree/TJC-LP/enable-xml-output

I'm going to test it in our Databricks workspace in the next few days, but locally seems to work as expected.

aleksandrskrivickis commented 2 months ago

Thank you very much. I'm going to test changes proposed now.