NoSuchMethodError createExtractor(Ljava/io/InputStream;)Lorg/apache/poi/POITextExtractor

huntext17 commented 2 years ago

Hi, It might be simple issue to resolve , so just posting it here.

I am just trying the poishdow-all.jar and trying to extract the text from office documents.

This is the code snippet.

                Parser officeParser = new OOXMLParser();

                stream = new FileInputStream(file);

                final List<String> chunks = new ArrayList<String>();
                chunks.add("");
                final int MAXIMUM_TEXT_CHUNK_SIZE = 1000;
                ContentHandlerDecorator handler2 = new ContentHandlerDecorator() {
                    @Override
                    public void characters(char[] ch, int start, int length) {
                        String lastChunk = chunks.get(chunks.size() - 1);
                        String thisStr = new String(ch, start, length);

                        if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
                            chunks.add(thisStr);
                        } else {
                            chunks.set(chunks.size() - 1, lastChunk + thisStr);
                            //return;
                        }
                    }
                };

                officeParser.parse(stream, handler2, metadata, context);

and this is the error I am getting..

No static method createExtractor(Ljava/io/InputStream;)Lorg/apache/poi/POITextExtractor; in class Lorg/apache/poi/extractor/ExtractorFactory

any help would be greatly appreciated, Thanks!

centic9 commented 2 years ago

Sounds like a library mismatch or a missing recompile/rebuild of the shadow library.

Do you pull in Apache POI sources from a different version from somewhere else via your Gradle dependencies?

huntext17 commented 2 years ago

Thank you so much for a quick response. Really appreciate. No I don't have any other POI sources but I was able to resolve the issue by using Extractorfactory directly. It works for docx and other latest office documents like xlsx, pptx but not for the older formats like doc. It could be same issue mentioned in other thread, but there was no error thrown in my case. I'll have to dig deeper on that. But since those formats are pretty old and not used that much nowadays, it should be ok for us for now..

Thanks for this framework. If I do any enhancements or additional features, I'll create a PR.

centic9 commented 2 years ago

Interesting.

OOXMLParser is your code, right? If you can extract this into a standalone snippet which reproduces the problem I can take a look if it is reproducible for me.

huntext17 commented 2 years ago

Hi there, Thanks for looking. OOXMLParser is from tika library. here is the code snippet.

Parser officeParser = new OOXMLParser();
File file = new File(ANY_PATH);
InputStream stream = new FileInputStream(file);

final List<String> chunks = new ArrayList<String>();
final int MAXIMUM_TEXT_CHUNK_SIZE = 1000;
                    ContentHandlerDecorator handler2 = new ContentHandlerDecorator() {
                        @Override
                        public void characters(char[] ch, int start, int length) {
                            String lastChunk = chunks.get(chunks.size() - 1);
                            String thisStr = new String(ch, start, length);

                            if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
                                chunks.add(thisStr);
                            } else {
                                chunks.set(chunks.size() - 1, lastChunk + thisStr);
                                //return;
                            }
                        }
                    };

officeParser.parse(stream, handler2, metadata, context);
 StringBuilder stringBuilder = new StringBuilder();
                    int i = 1;
                    for (String chunk : chunks) {
                        stringBuilder.append(chunk);

                        i++;

                    }
//Final text
               String     text = stringBuilder.toString();

centic9 commented 2 years ago

That is then likely the problem, Tika probably depends on a different version of Apache POI, not the one provided via "poishadow.jar". You could choose a version of Tika which depends on the exact same version of Apache POI as the one that poishadow.jar uses.

huntext17 commented 2 years ago

That makes sense. Thanks much! Appreciate it!

centic9 / poi-on-android

NoSuchMethodError createExtractor(Ljava/io/InputStream;)Lorg/apache/poi/POITextExtractor #99