Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Java exception: NoSuchMethodError when running minimum test #286

Closed pcolmer closed 8 years ago

pcolmer commented 8 years ago

I've downloaded the software onto an Ubuntu 14.04 system with this java:

java version "1.7.0_111" OpenJDK Runtime Environment (IcedTea 2.6.7) (7u111-2.6.7-0ubuntu0.14.04.3) OpenJDK 64-Bit Server VM (build 24.111-b01, mixed mode)

However, running the minimum test gives me an error:

./collector-http.sh -a start -c examples/minimum/minimum-config.xml INFO [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress INFO [JobSuite] JEF work directory is: ./examples-output/minimum/progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] No previous execution detected. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.5.1 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.5.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.5.2 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.) INFO [JobSuite] Running Norconex Minimum Test Page: BEGIN (Wed Aug 17 08:50:53 UTC 2016) INFO [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true INFO [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true INFO [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false INFO [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true INFO [HttpCrawler] Norconex Minimum Test Page: User-Agent: INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 0 second. FATAL [JobSuite] Fatal error occured in job: Norconex Minimum Test Page INFO [JobSuite] Running Norconex Minimum Test Page: END (Wed Aug 17 08:50:53 UTC 2016) FATAL [JobSuite] Job suite execution failed: Norconex Minimum Test Page java.lang.NoSuchMethodError: org.apache.http.impl.client.HttpClientBuilder.setSSLContext(Ljavax/net/ssl/SSLContext;)Lorg/apache/http/impl/client/HttpClientBuilder; at com.norconex.collector.http.client.impl.GenericHttpClientFactory.createHTTPClient(GenericHttpClientFactory.java:288) at com.norconex.collector.http.crawler.HttpCrawler.initializeHTTPClient(HttpCrawler.java:352) at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:114) at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:199) at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:174) at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49) at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:350) at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:300) at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:172) at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:120) at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:80) at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Am I missing something else that I need to install on Ubuntu?

Thanks.

pcolmer commented 8 years ago

I've made some progress ... this seems to be being caused by the action of copying files from the lib directory of the CloudSearch committer. If I wipe the lib directory of the collector and restore from the original collector download, the code works.

I'll see if I can pin it down to a specific jar file.

pcolmer commented 8 years ago

So it turns out that the CloudSearch download is an unfortunate mix of jars where some are older than those in the HTTP Collector and some are newer! (That's for the ones that already exist in the HTTP Collector).

Adding to the unfortunate nature of this is that since the version number is in the filename, simply copying the contents of lib really isn't a good thing to do because you end up with multiple copies, but I do understand why the version number is in the filename.

In this particular instance, I suspect it was httpclient and httpcore that were causing the problem. HTTP Collector ships with httpclient 4.5.2 and httpcore 4.4.4, while CloudSearch Committer ships with httpclient-4.3.6 and httpcore 4.3.3.

If I may suggest this, I think it would be really helpful if the common jar files were kept to consistent versions across the downloads and/or split them out into a separate download.

essiembre commented 8 years ago

Definitely, you need to make sure you only have one version (ideally the latest) of each Jar. When releasing committer jars over, it is tricky for us to make sure their Jar versions are always in sync since people may be working with different version of the collectors and the problem will always be occurring no matter what. We thought of not including Jars already present in collectors when we package committers, but again, different collectors (e.g. FileSystem vs HTTP) are shipped with different jars so it varies. Maybe we should introduce some quick jar-version checking on startup. We can make this a feature request if you like.

For now though, are OK after you cleaned-up the jar versions? Can this be closed?

pcolmer commented 8 years ago

Hi

Yes, I'm OK with this being closed.

Thanks.