KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Cast exception when using non-english language #118

Open chrisoverton91 opened 6 years ago

chrisoverton91 commented 6 years ago

I have found an issue caused when the windows language/region settings are switched away from English. (Confirmed with French and German on Windows 10). When I try to extract text from a document I get the below error. If I switch my language/region settings back to English UK it works fine again. I have made sure I am using the latest version but still no luck. Do you have any ideas?

TikaOnDotNet.TextExtraction.TextExtractionException: Extraction of text from the file 'C:\test\Sample Documents\Test Area\afile-11730 - Copy.txt' failed. ---> TikaOnDotNet.TextExtraction.TextExtractionException: Extraction failed. ---> System.TypeInitializationException: The type initializer for 'org.apache.tika.metadata.Metadata' threw an exception. ---> System.InvalidCastException: Unable to cast object of type 'java.util.PropertyResourceBundle' to type 'sun.util.resources.OpenListResourceBundle'. at sun.util.resources.LocaleData.getCurrencyNames(Locale locale) at sun.util.locale.provider.LocaleResources.getCurrencyName(String key) at sun.util.locale.provider.CurrencyNameProviderImpl.getString(String , Locale ) at sun.util.locale.provider.CurrencyNameProviderImpl.getSymbol(String currencyCode, Locale locale) at java.util.Currency.CurrencyNameGetter.getObject(CurrencyNameProvider , Locale , String , Object[] ) at java.util.Currency.CurrencyNameGetter.getObject(LocaleServiceProvider , Locale , String , Object[] ) at sun.util.locale.provider.LocaleServiceProviderPool.getLocalizedObjectImpl(LocalizedObjectGetter , Locale , Boolean , String , Object[] ) at sun.util.locale.provider.LocaleServiceProviderPool.getLocalizedObject(LocalizedObjectGetter getter, Locale locale, String key, Object[] params) at java.util.Currency.getSymbol(Locale locale) at java.text.DecimalFormatSymbols.initialize(Locale ) at java.text.DecimalFormatSymbols..ctor(Locale locale) at sun.util.locale.provider.DecimalFormatSymbolsProviderImpl.getInstance(Locale locale) at java.text.DecimalFormatSymbols.getInstance(Locale locale) at sun.util.locale.provider.NumberFormatProviderImpl.getInstance(Locale , Int32 ) at sun.util.locale.provider.NumberFormatProviderImpl.getIntegerInstance(Locale locale) at java.text.NumberFormat.getInstance(LocaleProviderAdapter , Locale , Int32 ) at java.text.NumberFormat.getInstance(Locale , Int32 ) at java.text.NumberFormat.getIntegerInstance(Locale inLocale) at java.text.SimpleDateFormat.initialize(Locale ) at java.text.SimpleDateFormat..ctor(String pattern, DateFormatSymbols formatSymbols) at org.apache.tika.metadata.Metadata.createDateFormat(String , TimeZone ) at org.apache.tika.metadata.Metadata..cctor() --- End of inner exception stack trace --- at org.apache.tika.metadata.Metadata..ctor() at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func2 streamFactory, Stream outputStream) --- End of inner exception stack trace --- at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func2 streamFactory, Stream outputStream) at TikaOnDotNet.TextExtraction.TextExtractor.Extract(Func`2 streamFactory) at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) --- End of inner exception stack trace --- at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath)

KevM commented 6 years ago

Not sure. This seems like an IKVM related problem. Maybe someone here can help better than I?

KevM commented 6 years ago

Could you create a PR with a test for this?

murilobom commented 6 years ago

I have the same problem.

Hyldahl commented 6 years ago

It sounds like the problem is that not all IKVM dll's is copied to the output folder during compilation. See: https://sourceforge.net/p/ikvm/mailman/message/35051603/

dolivanu commented 4 years ago

As pointed by @Hyldahl, it's an issue with VS not copying all IKVM DLLs to output folder. IMHO VS resolves only several IKVM assemblies as neededby the project, so there are missing DLLs used at runtime that provokes the error. In this case, the problem seems related to IKVM.OpenJDK.Cldrdata.dll that contains locale data; if switch to non-English language/locale, this DLL is requird at runtime and hence the error. This error seems to happen when referencing indirectly the IKVM package in the project through another project that requires it. There are mainly two solutions: create post-build action that copies the missing IKVM DLLs to output folder; reference the IKVM package in the target project (the one that is "executed", e.g. creates the EXE or hosts the webApp). The second one could be the simplest as IKVM could change (from 7.x to 8.x IKVM DLLs changed).