Closed vorou closed 8 years ago
Yes I have seen weird problems with Tika and large .doc files. It does use a lot of memory when processing. Is this a memory constrained system say < 4GB? I've never had a problem when running 8GB or more.
Sadly, this is likely not a tikaondotnet issue but a Tika one.
My laptop has only 4GB, but I could reproduce it on prod server with 8GB.
Also, non-IKVM version of Tika works just fine, so why do you think it's not a IKVM-related problem?
Sorry I didn't see you talking about a non-IKVM version test. In that case it is likely related to something with the .doc parser and how it allocates objects and IKVM not enjoying that scenario. This project is really just a re-packaging of Tika on top of IKVM. So if you run into out of memory issues one of the two is likely to blame.
Closing this pending more details.
I'd converted something like a 10k .doc/.docx succesfully and then met this one.
Tika converts it in just a few seconds, but IKVM version hangs forever with high resource usage. I've tried both current master and IKVM8/Tika1.9 build.
UPD: If I re-save the file in Word 2013, both Java and IKVM versions are able to extract the text.
UPD2: It eventually blows up w/ OutOfMemory:
UPD3: full procmon output, in case you know what to look for.