KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Internal exception opening a MsWord document. #120

Closed cvalde closed 6 years ago

cvalde commented 6 years ago

Hello, with this simple code var fn = <some .doc file> var textExtractor = new TextExtractor(); var wordDocContents = textExtractor.Extract(fn); I'm translating from Spanish:

Extraction failed. Exception in initializer of type 'org.apache.tika.metadata.Metadata'. Cannot convert object type 'java.util.PropertyResourceBundle' to type 'sun.util.resources.OpenListResourceBundle'.

I can send you an example project but since it's extremely simple (the same basic example shown at https://github.com/KevM/tikaondotnet) I think it would be better if I send you the file. I would rather avoid posting such a file in a public place because it's from my job.

Thanks. Claudio.

cvalde commented 6 years ago

I should correct myself. I was able to reproduce the error with a simple .docx document that I can send on request, but the issue seems to be of another type. The code var textExtractor = new TextExtractor(); var webPageContents = textExtractor.Extract(new Uri("https://google.com")); produces exactly the same exception Cannot convert object type 'java.util.PropertyResourceBundle' to type 'sun.util.resources.OpenListResourceBundle'. Could the operating system´s regional settings be the culprit? My Windows is set to Chile, Spanish. Thanks.

KevM commented 6 years ago

Hmm we have a unit test which basically does the same thing. Sounds like a IKVM problem. This post sounds similar. Maybe ensure you have all the required IKVM assemblies present in your application bin directory.

cvalde commented 6 years ago

Thanks for your answer. I used Nuget to install TikaOnDotnet.TextExtractor and it did all the job. Is there any listing that I can read to verify IKWM has all the files? I'm using VS 2015 with .net 4.5, not sure how to overcome this error.

KevM commented 6 years ago

We've had problems in the past with some required assembly not being included with people's deployments. There are a lot of IKVM assemblies and you'll need to include most of them when you deploy your application.

Based on your other issue #121 you are on the right track. You can build and run the tests to see text extractor work. It is easy to add a test for the particular document you are having problems with. Let me know if I can help.

cvalde commented 6 years ago

I was able to make TikaDN work with a simple Winforms project. Not sure what happened with the original WPF application. I will try in a few days more again.