Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

PhantomJs for Fetching Dynamic Data #545

Closed HappyCustomers closed 5 years ago

HappyCustomers commented 5 years ago

Hi Pascal,

Ref : Norconex/importer: Issue No Import only certain text from HTML file #87 (https://github.com/Norconex/importer/issues/87 )

Based on your advice on using PhantomJS for fetching dynamic data I tried implementing the same.

However in version 2.8.0 I am not getting any error and it is the not fetching the dynamic data.

In version 2.8.1 I am getting the following error

REJECTED_BAD_STATUS: https://xyz.com (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=-1, reasonPhrase=null])

I have sent the config file by email

Thank you

essiembre commented 5 years ago

That error is a generic one when PhantomJS failed to execute properly. Have you tried with and without setting the proxy solution described in the documentation? Also, make sure you have the log level to DEBUG and see if you get more insights.
Finally, check the logs for the command that is executed on the filesystem and try to run it manually to confirm whether you get content (you may have to modify the arguments). Let me know the outcome of trying the above.

HappyCustomers commented 5 years ago

Thanks Pascal for the quick response. I will set the log to DEBUG and check, However one question is that there is no error in version 2.8.0 and in version 2.8.1 I get REJECTED_BAD_STATUS error There is no Proxy setting as such.

essiembre commented 5 years ago

There were a few fixes to PhantomJSDocumentFetcher in 2.8.1. That is probably why you see a difference. Before maybe it was failing silently whereas now it shows an error.

HappyCustomers commented 5 years ago

OK, In DEBUG mode I am getting the following error [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL:

essiembre commented 5 years ago

Those could be difficult to troubleshoot. See if you can successfully download your page with using PhantomJS from the command-line. It may be easier to troubleshoot PhantomJS issues.

You can see other people have the same problem with PhantomJS:

https://github.com/ariya/phantomjs/issues/13806 https://github.com/ariya/phantomjs/issues/12750#issuecomment-281364082

It seems to be a PhantomJS bug. The second link points to a hacky solution. Not sure it would work for you.

Unfortunately, PhantomJS is no longer supported by its author. So if you did find a bug with it, there might not be a good solution. Version 3 of HTTP Collector will support working transparently with every major browser for dynamic content (Chrome, Firefox, Edge, etc.). We are hoping it will make things easier. Before you ask though... there is no release date for it yet. ;-)

HappyCustomers commented 5 years ago

Sorry , one more question how to disable proxy in PhantomJS settings? The log says [DEBUG] Set "http" proxy to: "" : 1080 [DEBUG] 9 proxyType : "http" [DEBUG] 10 proxy : ":1080" [DEBUG] 11 proxyAuth : ":"

This is the error

DEBUG - Unsupported HTTP Response: null INFO - REJECTED_BAD_STATUS:

I have not set any proxy in configuration

essiembre commented 5 years ago

You mean the proxy set by HTTP Collector? It is optional and by default, no proxy is applied. Same when using on the command-line I believe: it has to be set explicitly.

HappyCustomers commented 5 years ago

Pascal,

I have not set up any proxy in HTTP Collector, but PhantomJS is sill showing the proxy settings in the Log as above.

essiembre commented 5 years ago

Not sure where those come from. Can you share your full log file, in case more context may help?

HappyCustomers commented 5 years ago

sorry for the delay in responding. I have sent you the email with log and config file

essiembre commented 5 years ago

I was able to reproduce but could not find a solution. The "Operation canceled" error is pretty common amongst PhantomJS users but very few are able to fix it. Maybe you can find a suggestion online that works for you. If not, given PhantomJS development has stalled, I am afraid you will have to wait for version 3 of HTTP Collector. Or, maybe look at implementing your own IHttpDocumentFetcher that wraps a headless browser (or else) if possible for you.