Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Norconex: WebDriverHttpFetcher HtmlUnit Driver Support #1036

Closed shiveendrasiingh closed 3 weeks ago

shiveendrasiingh commented 3 months ago

Do you have support for htmlunit driver? I was trying to find out the configuration for the same but not able to find it. Is there anything I am missing.

essiembre commented 2 months ago

Hello @shiveendrasiingh, not sure if this reaches you too late, but out of the box support Chrome, Firefox, Edge, Safari and Opera are supported, no HtmlUnit.

You may be able to trick it by any required dependencies to the crawler classpath (e.g., lib folder) and configure it by first chosing supported browser name, but overwriting all its property to rather match HtmlUnit.

Something like this:

<fetcher class="WebDriverHttpFetcher">
   <!-- "browser" is required, but overwrite it by being explicit about other settings -->
   <browser>edge</browser>

   <browserPath>(HtmlUnit executable)</browserPath>
   <driverPath>(HtmlUnit WebDriver location)</driverPath>
   <capabilities>
     <capability name="..."><!-- anything required by HtmlUnit ---></capability>
   </capabilities>

You may be just fine with the above. if not, let us know and we can mark it as a feature request to support HtmlUnit.

stale[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.