Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Why `REJECTED_REDIRECTED` information messages when redirections are allowed? #657

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

I'm tested redirection with https://httpbin.davecheney.com/redirect/3. I add to <crawler> config :

      <httpClientFactory>
        <maxRedirects>10</maxRedirects>
      </httpClientFactory>
      <documentFetcher>
        <validStatusCodes>200,302</validStatusCodes>
      </documentFetcher> 

I get:

INFO [AbstractCrawlerConfig] Crawler event listener loaded: LinkCheckerCrawlerEventListener@54a67a45 ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) GenericDocumentFetcher: cvc-complex-type.4: Attribute 'class' must appear on element 'documentFetcher'. INFO [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress INFO [JobSuite] JEF work directory is: .\examples-output\minimum\progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.9.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.10.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.10.0 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.1.2 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.1.3 (Norconex Inc.) INFO [JobSuite] Running Norconex Test: BEGIN (Fri Jan 24 13:37:03 CET 2020) INFO [HttpCrawler] Norconex Test: RobotsTxt support: true INFO [HttpCrawler] Norconex Test: RobotsMeta support: true INFO [HttpCrawler] Norconex Test: Sitemap support: false INFO [HttpCrawler] Norconex Test: Canonical links support: true INFO [HttpCrawler] Norconex Test: User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) INFO [SitemapStore] Norconex Test: Initializing sitemap store... INFO [SitemapStore] Norconex Test: Done initializing sitemap store. INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] Norconex Test: Crawling references... INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://httpbin.davecheney.com/redirect/3 (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (https://httpbin.davecheney.com/relative-redirect/2)]) INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://httpbin.davecheney.com/relative-redirect/2 (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (https://httpbin.davecheney.com/relative-redirect/1)]) INFO [CrawlerEventManager] REJECTED_REDIRECTED: https://httpbin.davecheney.com/relative-redirect/1 (HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (https://httpbin.davecheney.com/get)]) INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://httpbin.davecheney.com/get INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://httpbin.davecheney.com/get INFO [DebugTagger] date=Fri, 24 Jan 2020 12:37:06 GMT INFO [DebugTagger] collector.content-type=application/json INFO [DebugTagger] server=envoy INFO [DebugTagger] x-envoy-upstream-service-time=1 INFO [DebugTagger] document.contentFamily=sourcecode INFO [DebugTagger] X-Parsed-By=org.apache.tika.parser.DefaultParser, org.apache.tika.parser.txt.TXTParser INFO [DebugTagger] transfer-encoding=chunked INFO [DebugTagger] collector.redirect-trail=https://httpbin.davecheney.com/redirect/3, https://httpbin.davecheney.com/relative-redirect/2, https://httpbin.davecheney.com/relative-redirect/1 INFO [DebugTagger] vary=Accept-Encoding INFO [DebugTagger] document.reference=https://httpbin.davecheney.com/get INFO [DebugTagger] access-control-allow-origin=* INFO [DebugTagger] access-control-allow-credentials=true INFO [DebugTagger] collector.is-crawl-new=false INFO [DebugTagger] document.contentType=application/json INFO [DebugTagger] Content-Encoding=ISO-8859-1 INFO [DebugTagger] content-type=application/json, application/json; charset=ISO-8859-1 INFO [DebugTagger] collector.depth=0 INFO [DebugTagger] Content-Length=450 INFO [CrawlerEventManager] DOCUMENT_IMPORTED: https://httpbin.davecheney.com/get INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: https://httpbin.davecheney.com/get INFO [AbstractCrawler] Norconex Test: Reprocessing any cached/orphan references... INFO [AbstractCrawler] Norconex Test: Crawler finishing: committing documents. INFO [AbstractCrawler] Norconex Test: 4 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] Norconex Test: Crawler completed. INFO [AbstractCrawler] Norconex Test: Crawler executed in 4 secondes. INFO [SitemapStore] Norconex Test: Closing sitemap store... INFO [JobSuite] Running Norconex Test: END (Fri Jan 24 13:37:03 CET 2020)

It's OK, I got 3 entries in collector.redirect-trail. But why does we have the 3 REJECTED_REDIRECTED information messages ?

essiembre commented 4 years ago

Upon encountering a redirect, it will "reject" the original URL and store the target URL for processing. In the end (unless you are filtering it out), you should end up with the final URL being committed only.

LeMoussel commented 4 years ago

OK. Thank you for your support.