Closed Tushar-Mehndiratta closed 4 years ago
looks like your Python code that calls Tika is taking in an input file name that includes unicode. You may want to change your code to to:
file_name = codecs.decode(input('Enter the file name:'), "utf-8")
It raises another error:
See this article. You have to read the filename string in UTF-8. I didn't literally mean take the code I gave and paste it, sorry it was pseudo code and I didn't try to run it. Good luck!
Ok, I found another similar way to do that without using codecs module.
Also, Can I get help with this:
This occurs first time(on each new day) I run the code. Due to this there is a delay in getting the output
Thanks @Tushar-Mehndiratta I don't have a fix for the failed to see the startup log message, however you can safely ignore it, as Tika still works fine even if it doesn't see the log output on first run.
But it causes a huge delay (15 seconds- 50 seconds) in executing the whole code
I suppose we could consider putting in an option to not check but overall you need to check for the log to be written in order to run the client (aka the server needs to be running after starting). It's never 15 seconds. It's a few seconds. If you have a better idea and/or sample PR that illustrates it I'm all ears.
Maybe we could put a time bound, that if the log file is not found within (say) 5 seconds, if not then it could make a new log file... What do you think about this ?
well the danger is not to get the log file to be created via Python - the thing that creates the log file is the starting of the Tika server ... so if that log file isn't created you want it to error out.
I am facing an error while trying to convert a .docx file to xhtml output. Similar issue was faced in several other (doc/docx/pdf files)
UnicodeEncodeError Description:
Program Code that I used: `from tika import parser
file_name = input("Enter the file name :")
def extract_html_text(file_name): parsed_html = parser.from_file(file_name, xmlContent=True) parsed_html_text = parsed_html['content'] return parsed_html_text
html_text = extract_html_text(file_name)
print(html_text) `
Also, Can I get help with this: