Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Version 3.x -- Support for Remote Web Driver (Selenium Hub) #717

Closed ahaw022 closed 4 years ago

ahaw022 commented 4 years ago

Hi Pascal

Not sure how doable this is but it would be great if we could have the ability to use remote web drivers (selenium hub) as well as local based drivers.

We tend to run the Selenium Drivers as clusters to make sure that we don't open to many web browsers on one VM.

https://www.selenium.dev/documentation/en/remote_webdriver/remote_webdriver_client/

Andrei

essiembre commented 4 years ago

Hello, I deployed a new snapshot that adds a new <remoteURL> configuration option to the WebDriverHttpFetcher. When set, it will use a RemoteWebDriver instead of a local one. I do not have a remote cluster to test it. It would be nice if you could confirm that new option works.

ahaw022 commented 4 years ago

Hi Pascal

Looking Good!!!

Few things I picked up.

A) Using the config test tool I was getting errors on where to put the committer. I finally figured out it needs to go between the httpFetchers tags. So that tools is working well.

<httpFetchers>
    <fetcher
    class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
        <browser>chrome</browser>
        <remoteURL>{{your selenium cluster url and port}}/wd/hub</remoteURL>
        <!-- if running in docker refer to how to add drivers/hub here https://github.com/SeleniumHQ/docker-selenium-->
    </fetcher>
</httpFetchers>

B) I believe that for the remote configs you will need to add the port and wd/hub otherwise it won't work eg. url:4444/wd/hub. Typically selenium hubs run on port 4444. C) The easiest way for people to test locally is to use a docker setup. The selenium project has very good documentation on this here: https://github.com/SeleniumHQ/docker-selenium

Below is a Screenshot of our Selenium Cluster Showing Two Chrome Sessions Being Used 👍

image

Logs of Chrome Session Creation:

image

Confirmation of Content Being Brought Back using your test site:

image

I have included the baseline config (minus the URL of the selenium driver below for others to reference if needed). This is based on your simple example but adds the Selenium Remote Fetcher.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2020 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <workDir>./examples-output/complex</workDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://opensource.norconex.com/collectors/http/test/complex1</url>
      </startURLs>
        <httpFetchers>
                <fetcher
    class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
  <browser>chrome</browser>
  <remoteURL>{{your selenium cluster url and port}}/wd/hub</remoteURL>
  <!-- if running in docker refer to how to add drivers/hub here https://github.com/SeleniumHQ/docker-selenium-->
</fetcher>

        </httpFetchers>

      <!-- === Recommendations: ============================================ -->

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>10</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="1 seconds" />

      <!-- Document importing -->
      <importer>

        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <handler class="KeepOnlyTagger">
            <fieldMatcher method="csv">title,keywords,description,document.reference</fieldMatcher>      
          </handler>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committers>
        <committer class="XMLFileCommitter">
          <indent>4</indent>
        </committer>
      </committers>

    </crawler>
  </crawlers>

</httpcollector>