Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

[DOMSplitter] JSoup issue : NoSuchMethodError: org.jsoup.nodes.Element.cssSelector()Ljava/lang/String #356

Closed sylvainroussy closed 7 years ago

sylvainroussy commented 7 years ago

Hi, The jsoup api version 1.7.2 (pulled from tika-parsers dependency) causes NoSuchMethodError (but works well with jsoup 1.9.1) :

ERROR - AbstractCrawler            - 3_3: Could not process document: http://www.bfmtv.com/rss/international/ (org.jsoup.nodes.Element.cssSelector()Ljava/lang/String;)
java.lang.NoSuchMethodError: org.jsoup.nodes.Element.cssSelector()Ljava/lang/String;
    at com.norconex.importer.handler.splitter.impl.DOMSplitter.splitApplicableDocument(DOMSplitter.java:171)
    at com.norconex.importer.handler.splitter.AbstractDocumentSplitter.splitDocument(AbstractDocumentSplitter.java:57)
    at com.norconex.importer.Importer.splitDocument(Importer.java:585)
    at com.norconex.importer.Importer.executeHandlers(Importer.java:355)
    at com.norconex.importer.Importer.importDocument(Importer.java:309)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:271)
    at com.norconex.importer.Importer.importDocument(Importer.java:195)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
essiembre commented 7 years ago

This is why 1.10.2 gets shipped with Norconex HTTP Collector, not 1.7.2. Do you have any different? You can see that as a managed dependency in the HTTP Collector project pom.xml.

sylvainroussy commented 7 years ago

Weird, the jsoup 1.10.2 dependency is in the dependencyManagement part of the HttpCollector pom. My dependency hierarchy is :

norconex-collector-http:2.7.1
  norconex-collector-core:1.8.2
    norconex-importer:2.7.2
      tika-parsers:1.14
        grib:4.5.5
          jsoup:1.7.2
essiembre commented 7 years ago

Yeah, dependencyManagement is used to manage transitive dependencies here, and managed dependencies are not carried through when referencing a Maven project. So you would have to copy the dependencyManagement section in your project.

Because there is little value in handling transitive dependencies that way over just using dependencies, I will consider changing this in a future release to make referencing the project with Maven a bit easier.

sylvainroussy commented 7 years ago

Ok, thanks!