Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

include out of domain links only when referrer is domain #786

Closed caesetia closed 2 years ago

caesetia commented 2 years ago

hi, i'm trying to crawl with an infinite depth on my site's domain + any direct out of domain links on the page, but no further.

i found this bit on a previous question where it includes the out of domain link only if the referrer includes the domain name, but it's not working for me:

<documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
            https://domain.domain.com/.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include">
            https://domain.domain.com/.*
        </filter>      
</documentFilters>

i'm on norconex version 2.8.1, here's my configuration. it's an internal site unfortunately. thanks for any help you can provide.

<crawlerListeners>
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
      <statusCodes></statusCodes>
      <outputDir>./report</outputDir>
      <fileNamePrefix>brokenLinks</fileNamePrefix>
  </listener>
</crawlerListeners>

<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/online-registration/.*
    </filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/seminar_videos/.*
    </filter>    
</referenceFilters>

<documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
            https://domain.domain.com/.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include">
            https://domain.domain.com/.*
        </filter>      
</documentFilters>
      <!-- Document importing -->
      <importer>
<preParseHandlers>
<filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="document.reference" onMatch="include">
            https://domain.domain.com/.*
        </filter>  
</preParseHandlers>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,text,description,document.reference,id,extension,Content-Type,content_type,filetype,collector.referrer-reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 
essiembre commented 2 years ago

Can you please share your full config? What is your <startURL> tag like? If you have stayOnXXX flags to true, external sites won't be crawled regardless of your filters.

caesetia commented 2 years ago

thanks for getting back to me. i switched all stayonxxx to false, and that did create the index i wanted, but the crawler is still going through external sites to reject all the individual pages. i'd like it to stop crawling when after it includes that initial external page

<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/file-404/progress</progressDir>
  <logsDir>./examples-output/file-404/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
        <url>https://domain.domain.com/</url>
      </startURLs>

      <httpClientFactory>
  <authMethod>basic</authMethod>
  <authUsername>user</authUsername>
  <authPassword>password</authPassword>
</httpClientFactory>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/file-404</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>5</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="2000" />

<crawlerListeners>
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
      <statusCodes></statusCodes>
      <outputDir>./report</outputDir>
      <fileNamePrefix>brokenLinks</fileNamePrefix>
  </listener>
</crawlerListeners>

<referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/online-registration/.*
    </filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
      .*/seminar_videos/.*
    </filter>    
</referenceFilters>

<documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
            https://domain.domain.com/.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include">
            https://domain.domain.com/.*
        </filter>      
</documentFilters>
      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,text,description,document.reference,id,extension,Content-Type,content_type,filetype,collector.referrer-reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
          <solrURL>http://url/solr/file_404</solrURL>
      <sourceReferenceField keep="false">document.reference</sourceReferenceField>
      <targetReferenceField>id</targetReferenceField>
      <targetContentField>text</targetContentField>
      <commitBatchSize>100</commitBatchSize>
      <queueSize>100</queueSize>
      <maxRetries>2</maxRetries>
      <maxRetryWait>5000</maxRetryWait>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>
essiembre commented 2 years ago

I tested with your config and version 2.9.1 and the following worked for me:

  <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher"/>
  <metadataFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
      https://domain.domain.com/.*
    </filter>
    <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter" field="collector.referrer-reference" onMatch="include">
      https://domain.domain.com/.*
    </filter>      
  </metadataFilters>

I replaced your document filters with metadata filters so documents that are rejected are not downloaded and also in order to not extract more URLs to follow (URL extraction occurs before document filters).

To get the speed benefit, you need to add the metadata fetcher entry otherwise metadata filters will be processed after the document is downloaded. Having it separate like that, the crawler will issue a HEAD request and discard the document if not good, before it gets downloaded.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.