Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

PhantomJs exit value 137 causes NPE #408

Closed ebbesson closed 6 years ago

ebbesson commented 6 years ago

I'm experiencing issues while using the PhantomJSFetcher. Every odd run or so PhantomJS exits with value 137 and this seem to cause an NPE when trying to check for content-type.

ERROR SystemCommand:304 - Command returned with exit value 137 (command properly escaped?). Command: ./phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /app/tron/crawler/scripts/phantom.js http://example.com/path/more/page/ /tmp/1507889738145000000 1000 -1 http sepu 1.0 Error: "" INFO REJECTED_ERROR:67 - REJECTED_ERROR: http://example.com/path/more/page/ ERROR AbstractCrawler:549 - SHB crawler attachment: Could not process document: http://example.com/path/more/page/ (null) java.lang.NullPointerException at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.isHTMLByContentType(PhantomJSDocumentFetcher.java:640) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:507) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

I'm running the following versions

ebbesson commented 6 years ago

I've currently mitigated this by doing a "StringUtils.isBlank()" check on contentType in com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.isHTMLByContentType and this seems to work. I suspect that the error is further up in the code though.

essiembre commented 6 years ago

Thanks for reporting and finding the cause! The fix will be in the next snapshot release. I will let you know when it is available.

essiembre commented 6 years ago

This is now fixed in latest snapshot release. Please confirm.

essiembre commented 6 years ago

Closing for lack of feedback. Please re-open if you witness any issues with the fix.