alok-gupta-sada commented 5 years ago

@essiembre

Hi Pascal

Norconex version: 2.9.0-SNAPSHOT

We have encountered an issue where new additions to robots.txt file are not honored by Norconex crawler. The new disallows are not being removed from the index. I am using Norconex 2.9.0.snapshot version. The initial run did honor the robots.txt file and rejected the documents that were specified in the robots.txt file.

 #inital robots.txt
  User-agent: *
  Disallow: /pdfs/holidays2015.pdf
  Disallow: /pdfs/holidays2016.pdf
  Disallow: /pdfs/holidays2017.pdf
  Disallow: /pdfs/holidays2018.pdf

Later, we added 2 more disallows to the robots.txt file -

 #robots.txt
  User-agent: *
  Disallow: /pdfs/holidays2015.pdf
  Disallow: /pdfs/holidays2016.pdf
  Disallow: /pdfs/holidays2017.pdf
  Disallow: /pdfs/holidays2018.pdf
  Disallow: /pdfs/test_annual_report_2011.pdf
  Disallow: /pdfs/test_annual_report_2012.pdf

The crawler logs indicated these 2 documents are not modified and ignored them.

 INFO  (CrawlerEventManager.java:67) -          DOCUMENT_FETCHED: http://edev.test.com/pdfs/test_annual_report_2011.pdf
    INFO  (CrawlerEventManager.java:67) -       CREATED_ROBOTS_META: http://edev.test.com/pdfs/test_annual_report_2011.pdf
    INFO  (CrawlerEventManager.java:67) -       REJECTED_UNMODIFIED: http://edev.test.com/pdfs/test_annual_report_2011.pdf
    INFO  (CrawlerEventManager.java:67) -          DOCUMENT_FETCHED: http://edev.test.com/pdfs/test_annual_report_2012.pdf
    INFO  (CrawlerEventManager.java:67) -       CREATED_ROBOTS_META: http://edev.test.com/pdfs/test_annual_report_2012.pdf
    INFO  (CrawlerEventManager.java:67) -       REJECTED_UNMODIFIED: http://edev.test.com/pdfs/test_annual_report_2012.pdf

norconex configuration

 <crawlers>
    <crawler id="Test Intranet Page">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://edev.test.com/</url>
      </startURLs>
      <maxDepth>-1</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <numThreads>2</numThreads>
      <sitemapResolverFactory ignore="true" />
      <robotsTxt ignore="false" />
      <orphansStrategy>DELETE</orphansStrategy>
      <delay default="10" />
      <userAgent>gcs-norconex-crawler</userAgent>

      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
            removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, removeDotSegments, addDomainTrailingSlash, removeTrailingHash
        </normalizations>
      </urlNormalizer>

      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
                <singleValue field="rawContent" action="keepFirst"/>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
            <copy fromField="collector.referenced-urls" toField="referencedurls" overwrite="false" />
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.contentType,document.reference,referencedurls,rawContent</fields>
          </tagger>
        </postParseHandlers>    
      </importer> 

      <committer class="com.sada.norconex.SADACommitter">
        ...
      </committer>

    </crawler>
  </crawlers>

essiembre commented 5 years ago

Rejected documents are not sent for deletions. Only documents that no longer exist, or "orphan" ones. You could think your <orphansStrategy>DELETE</orphansStrategy> would do it, but orphans are URLs that are no longer being referenced. In your case, those URLs are still being referenced, simply rejected (i.e., they will not be sent to your search engine).

This is related to feature request #211. If you think it is the same request, I will mark this one as a duplicate.

alok-gupta-sada commented 5 years ago

Thanks for looking into this issue.

My understanding is that the HTTP collector will honor the robots.txt entries.

The crawler should always check the document against the updated robots.txt entries irrespective of their modification state and send a delete call for indexed documents. The crawler should log the event REJECTED_ROBOTS_TXT instead of REJECTED_UNMODIFIED in the logs.

I have a couple of questions:

Does the robots.txt check is performed before fetching the document for content modification?
Does the robots.txt match is done against the original URL or the normalized URL?

/ Alok

essiembre commented 5 years ago

Right now URLs rejected by the robots.txt are simply not processed at all (they are ignored/skipped). This greatly improves performance for many crawls. Comparing every URLs against URLs previously crawled would mean querying the crawl store for every URL encountered, just in case there was a change to the robots.txt. That is currently not offered out-of-the-box. I will mark this as a feature request to provide the option to do so.

To answers your questions:

Does the robots.txt check is performed before fetching the document for content modification?

Yes, before. A document will not be downloaded if rejected by robots.txt (default behavior).

Does the robots.txt match is done against the original URL or the normalized URL?

I suggest you refer to the following flow diagram to get a better understanding of various task execution order: https://www.norconex.com/collectors/collector-http/flow

alok-gupta-sada commented 5 years ago

Thanks for your response and for marking it as a feature request.

/ Alok

essiembre commented 3 years ago

With version 3.0.0 it is now possible to send deletion requests to your Committer(s) by listening to rejection events with DeleteRejectedEventListener. Have a look at https://github.com/Norconex/collector-http/issues/211#issuecomment-927245122 for an example.

Norconex / crawlers

New additions to robots.txt file are not getting deleted by Norconex crawler #643

norconex configuration