Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

[DOMSplitter] JSoup issue with norconex-importer 2.5.2 #23

Closed sylvainroussy closed 8 years ago

sylvainroussy commented 8 years ago

Hi!

I Get the following exception when I use the DOMSplitter :

java.lang.NoSuchMethodError: org.jsoup.nodes.Element.cssSelector()Ljava/lang/String; at com.norconex.importer.handler.splitter.impl.DOMSplitter.splitApplicableDocument(DOMSplitter.java:151)

With configuration:

<importer>
                <preParseHandlers>
                <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
                        selector=".caption"  sourceCharset="UTF-8"/>
[...]
</importer>
essiembre commented 8 years ago

I just tried this config snippet with 2.5.2 and could not reproduce the error. Not finding the method suggest a invalid library version or similar issue.

Have you just unzipped a fresh copy of version 2.5.2 before trying this? In wonder if you have mixed jsoup library versions in your classpath? I have "jsoup-1.8.3.jar" in the test I just did and I can confirm the class/method reported in your error exists in that Jar. Another thing to consider maybe is the file being corrupted for some reason? I doubt this is the cause, but maybe try downloading it again just in case.

sylvainroussy commented 8 years ago

Hello! The version of my jsoup is 1.7.2 :

norconex-collector-http (2.5.0)
> norconex-collector-core (1.5.0)
  > norconex-importer (2.5.2)
    > tika-parsers (1.12)
     > grib (4.5.5)
      >jsoup (1.7.2)

No mixing jsoup in my pom.xml or classpath, adding a more recent version of Jsoup works.

essiembre commented 8 years ago

Good catch. Norconex Importer had JSoup 1.8.3 as a managed dependency to by pass the version that comes with Tika, but that was not carried through to the HTTP Collector. This is now fixed in the latest snapshot release of HTTP Collector (it shall have JSoup 1.8.3 now).

I am closing this since we have a working fix.

sveba commented 7 years ago

The problem still persists. When using the HTTP collector às mvn dependency the jsoup is still being downloaded in version 1.7.2 and this breakes the DOMSplitter at line 171 String childEmbedRef = elm.cssSelector();

essiembre commented 7 years ago

When using Maven, the issue is different, look here for a solution: https://github.com/Norconex/collector-http/issues/356

sveba commented 7 years ago

Thats exactly what I did :)