KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Exception on a simple pdf file extraction #158

Open wis-niowy opened 1 year ago

wis-niowy commented 1 year ago

I started playing with TikaOnDotnet today and created a simple case with pdf file extraction. Unfortunately I have an issue when calling TextExtractor.Extract() method (both overloads - with byte[] and string path as arguments) The exception is:

TextExtractionException: Extraction failed.
TypeInitializationException: The type initializer for 'java.nio.charset.StandardCharsets' threw an exception.
TypeLoadException: Could not load type 'System.Reflection.Emit.MethodToken' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089'.

The code to reproduce is very simple - I only do:

var tikaResult_path = new TextExtractor().Extract(pathToPdf);
//(..)
// .. get file stream and initialize StreamReader instance
var bytes = await streamReader.ReadToEndAsync();
var tikaResult_bytes = new TextExtractor().Extract(bytes);

They both fail with the same exceptions.

The version of TikaOnDotNet.TextExtraction installed: 1.17.1 (date published: Tuesday, April 3, 2018 (4/3/2018))

I saw this comment in another issue: https://github.com/KevM/tikaondotnet/issues/118#issuecomment-551432052 And verified whether these dlls mentioned there get copied to the output folder - and yes, they do get copied (i.e. IKVM.OpenJDK.Cldrdata.dll).