ICIJ / node-tika

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
MIT License
138 stars 36 forks source link

Return XHTML content as well (the Tika default) #2

Closed alxlo closed 9 years ago

alxlo commented 9 years ago

It would be awesome, if the bridge could not only deliver plain text, but as well the XHTML that can be generated by the Tika default configuration :-)

mattcg commented 9 years ago

Hi @alxlo, out of interest, what is the use-case for XHTML output?

alxlo commented 9 years ago

Hi Matthew, the XHTML contains a separate div-section for each page of a PDF file. I am currently experimenting with lunr.js to generate a search index on the server side to be used by a (mobile) client application and hope to deliver more precise search hits by indicating not only the PDF, but also the the page in the PDF for search hits. Best regards, Alexander

mattcg commented 9 years ago

Try the tika.xhtml method available on the master branch. Be warned that there are breaking changes with the way options are specified.

alxlo commented 9 years ago

Thank you so much, works like a charme!

mattcg commented 9 years ago

Fantastic :smile_cat: