chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character #319

Closed Tushar-Mehndiratta closed 4 years ago

Tushar-Mehndiratta commented 4 years ago

I am facing an error while trying to convert a .docx file to xhtml output. Similar issue was faced in several other (doc/docx/pdf files)

UnicodeEncodeError Description: image

Program Code that I used: image `from tika import parser

file_name = input("Enter the file name :")

def extract_html_text(file_name): parsed_html = parser.from_file(file_name, xmlContent=True) parsed_html_text = parsed_html['content'] return parsed_html_text

html_text = extract_html_text(file_name)

print(html_text) `

Also, Can I get help with this: image

chrismattmann commented 4 years ago

looks like your Python code that calls Tika is taking in an input file name that includes unicode. You may want to change your code to to:

file_name = codecs.decode(input('Enter the file name:'), "utf-8")

Tushar-Mehndiratta commented 4 years ago

It raises another error: image

chrismattmann commented 4 years ago

See this article. You have to read the filename string in UTF-8. I didn't literally mean take the code I gave and paste it, sorry it was pseudo code and I didn't try to run it. Good luck!

Tushar-Mehndiratta commented 4 years ago

Ok, I found another similar way to do that without using codecs module.

Also, Can I get help with this: image

This occurs first time(on each new day) I run the code. Due to this there is a delay in getting the output

chrismattmann commented 4 years ago

Thanks @Tushar-Mehndiratta I don't have a fix for the failed to see the startup log message, however you can safely ignore it, as Tika still works fine even if it doesn't see the log output on first run.

Tushar-Mehndiratta commented 4 years ago

But it causes a huge delay (15 seconds- 50 seconds) in executing the whole code

chrismattmann commented 4 years ago

I suppose we could consider putting in an option to not check but overall you need to check for the log to be written in order to run the client (aka the server needs to be running after starting). It's never 15 seconds. It's a few seconds. If you have a better idea and/or sample PR that illustrates it I'm all ears.

Tushar-Mehndiratta commented 4 years ago

Maybe we could put a time bound, that if the log file is not found within (say) 5 seconds, if not then it could make a new log file... What do you think about this ?

chrismattmann commented 4 years ago

well the danger is not to get the log file to be created via Python - the thing that creates the log file is the starting of the Tika server ... so if that log file isn't created you want it to error out.