ICIJ / node-tika

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
MIT License
138 stars 36 forks source link

set output encoding to UTF-8 #1

Closed matthiasg closed 10 years ago

matthiasg commented 10 years ago

When parsing some PDF files the text output was not in UTF8 (German Umlaute where wrong for example). I added explicit default UTF8 encoding for the OutputStreamWriter used and i get UTF8 output now.

mattcg commented 10 years ago

Thanks for this :) Just one question in the line comments.

mattcg commented 10 years ago

Ok, pulled in commit 8b00b4c4e2c59c156b8a3f35d166221771c09a33. I've added your name to the contributors file.

matthiasg commented 10 years ago

Thanks.. Even better that you updated to next Tika .. Wanted to do that myself .. But got stuck on node-java not working on SmartOS ..

mattcg commented 10 years ago

Yeah, Tika 1.6 fixes a lot of issues I had with parsing PDFs :smile:

chrismattmann commented 10 years ago

BOOM this is awesome!