PhantomJs exit value 137 causes NPE

ebbesson commented 6 years ago

I'm experiencing issues while using the PhantomJSFetcher. Every odd run or so PhantomJS exits with value 137 and this seem to cause an NPE when trying to check for content-type.

ERROR SystemCommand:304 - Command returned with exit value 137 (command properly escaped?). Command: ./phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /app/tron/crawler/scripts/phantom.js http://example.com/path/more/page/ /tmp/1507889738145000000 1000 -1 http sepu 1.0 Error: "" INFO REJECTED_ERROR:67 - REJECTED_ERROR: http://example.com/path/more/page/ ERROR AbstractCrawler:549 - SHB crawler attachment: Could not process document: http://example.com/path/more/page/ (null) java.lang.NullPointerException at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.isHTMLByContentType(PhantomJSDocumentFetcher.java:640) at com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.fetchDocument(PhantomJSDocumentFetcher.java:507) at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:42) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:358) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:521) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:407) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:789) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

I'm running the following versions

phantomjs 2.1.1-linux-x86_64
collector-http 2.7.1
importer 2.7.2

ebbesson commented 6 years ago

I've currently mitigated this by doing a "StringUtils.isBlank()" check on contentType in com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher.isHTMLByContentType and this seems to work. I suspect that the error is further up in the code though.

essiembre commented 6 years ago

Thanks for reporting and finding the cause! The fix will be in the next snapshot release. I will let you know when it is available.

essiembre commented 6 years ago

This is now fixed in latest snapshot release. Please confirm.

essiembre commented 6 years ago

Closing for lack of feedback. Please re-open if you witness any issues with the fix.

Norconex / crawlers

PhantomJs exit value 137 causes NPE #408