KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

IKVM version of Tika hangs during Word file extracting #28

Closed vorou closed 8 years ago

vorou commented 9 years ago

I'd converted something like a 10k .doc/.docx succesfully and then met this one.

Tika converts it in just a few seconds, but IKVM version hangs forever with high resource usage. I've tried both current master and IKVM8/Tika1.9 build.

UPD: If I re-save the file in Word 2013, both Java and IKVM versions are able to extract the text.

UPD2: It eventually blows up w/ OutOfMemory:

Unhandled Exception: TikaOnDotNet.TextExtractionException: Extraction of text from the file 'C:\Users\vorou\Desktop\ftw\input\hang-doc.docx' failed. ---> TikaOnDotNet.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(SchemaTypeLoader stl, InputStream is, SchemaType type, XmlOptions options)
   at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(InputStream jiois, SchemaType type, XmlOptions options)
   at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument.Factory.parse(InputStream is)
   at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead()
   at org.apache.poi.POIXMLDocument.load(POIXMLFactory factory)
   at org.apache.poi.xwpf.usermodel.XWPFDocument..ctor(OPCPackage pkg)
   at org.apache.poi.xwpf.extractor.XWPFWordExtractor..ctor(OPCPackage container)
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at TikaOnDotNet.TextExtractor.Extract(Func`2 streamFactory) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 92
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtractor.Extract(Func`2 streamFactory) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 108
   at TikaOnDotNet.TextExtractor.Extract(String filePath) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 51
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtractor.Extract(String filePath) in c:\Users\vorou\code\tikaondotnet\src\TikaOnDotnet\TextExtractor.cs:line 60
   at Word2Txt.Program.Main() in C:\Users\vorou\code\Word2Txt\Word2Txt\Program.cs:line 19

UPD3: full procmon output, in case you know what to look for.

KevM commented 9 years ago

Yes I have seen weird problems with Tika and large .doc files. It does use a lot of memory when processing. Is this a memory constrained system say < 4GB? I've never had a problem when running 8GB or more.

Sadly, this is likely not a tikaondotnet issue but a Tika one.

vorou commented 9 years ago

My laptop has only 4GB, but I could reproduce it on prod server with 8GB.

Also, non-IKVM version of Tika works just fine, so why do you think it's not a IKVM-related problem?

KevM commented 9 years ago

Sorry I didn't see you talking about a non-IKVM version test. In that case it is likely related to something with the .doc parser and how it allocates objects and IKVM not enjoying that scenario. This project is really just a re-packaging of Tika on top of IKVM. So if you run into out of memory issues one of the two is likely to blame.

KevM commented 8 years ago

Closing this pending more details.