Closed huntext17 closed 2 years ago
Sounds like a library mismatch or a missing recompile/rebuild of the shadow library.
Do you pull in Apache POI sources from a different version from somewhere else via your Gradle dependencies?
Thank you so much for a quick response. Really appreciate. No I don't have any other POI sources but I was able to resolve the issue by using Extractorfactory directly. It works for docx and other latest office documents like xlsx, pptx but not for the older formats like doc. It could be same issue mentioned in other thread, but there was no error thrown in my case. I'll have to dig deeper on that. But since those formats are pretty old and not used that much nowadays, it should be ok for us for now..
Thanks for this framework. If I do any enhancements or additional features, I'll create a PR.
Interesting.
OOXMLParser is your code, right? If you can extract this into a standalone snippet which reproduces the problem I can take a look if it is reproducible for me.
Hi there, Thanks for looking. OOXMLParser is from tika library. here is the code snippet.
Parser officeParser = new OOXMLParser();
File file = new File(ANY_PATH);
InputStream stream = new FileInputStream(file);
final List<String> chunks = new ArrayList<String>();
final int MAXIMUM_TEXT_CHUNK_SIZE = 1000;
ContentHandlerDecorator handler2 = new ContentHandlerDecorator() {
@Override
public void characters(char[] ch, int start, int length) {
String lastChunk = chunks.get(chunks.size() - 1);
String thisStr = new String(ch, start, length);
if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
chunks.add(thisStr);
} else {
chunks.set(chunks.size() - 1, lastChunk + thisStr);
//return;
}
}
};
officeParser.parse(stream, handler2, metadata, context);
StringBuilder stringBuilder = new StringBuilder();
int i = 1;
for (String chunk : chunks) {
stringBuilder.append(chunk);
i++;
}
//Final text
String text = stringBuilder.toString();
That is then likely the problem, Tika probably depends on a different version of Apache POI, not the one provided via "poishadow.jar". You could choose a version of Tika which depends on the exact same version of Apache POI as the one that poishadow.jar uses.
That makes sense. Thanks much! Appreciate it!
Hi, It might be simple issue to resolve , so just posting it here.
I am just trying the poishdow-all.jar and trying to extract the text from office documents.
This is the code snippet.
and this is the error I am getting..
No static method createExtractor(Ljava/io/InputStream;)Lorg/apache/poi/POITextExtractor; in class Lorg/apache/poi/extractor/ExtractorFactory
any help would be greatly appreciated, Thanks!