Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
182 stars 68 forks source link

no crawling occur if the URL was direct to another page #281

Closed doaa-khaled closed 8 years ago

doaa-khaled commented 8 years ago

I used to insert URL without writing the home page, as an example: `

https://en.wikipedia.org/

instead doing

https://en.wikipedia.org/wiki/Main_Page

`

in first case it doesn't crawl, but in the second it works fine, the problem is that i have a huge number of URLs written by first way.. Is there a way in configuration to accept the first formula ?

essiembre commented 8 years ago

It should work. In several cases when it does not work, it is because you have rules that excludes them. While that is not the example you provided, one frequent case is a home page that redirects to a URL with a different domain or protocol while you have stayOnDomain and/or stayOnProtocol set to true. You can relax those settings and use your own filters instead.

In the case of Wikipedia above, I suspect you have settings excluding them. It works when I reproduce with just the XML you provided:

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>https://en.wikipedia.org/</url>
</startURLs> 
doaa-khaled commented 8 years ago

this is my configuration, can you check if please ?

<?xml version="1.0" encoding="UTF-8"?><httpcollector id="kcn">#set($workdir = "D:\crawler-ouput")<progressDir>$workdir\progress</progressDir><logsDir>$workdir\logs</logsDir><crawlers><crawler id="Kycon7"><startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true"><url>https://en.wikipedia.org</url></startURLs><workDir>$workdir</workDir><delay default="30000" /><numThreads>15</numThreads><maxDepth>-1</maxDepth><maxDocuments>-1</maxDocuments><keepDownloads>false</keepDownloads><listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener"><outputDir>$workdir</outputDir></listener><crawlDataStoreFactory class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" /><referenceFilters><filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude" caseSensitive="false" >jpg,gif,png,ico,css,js</filter></referenceFilters><robotsTxt ignore="true" /><robotsMeta ignore="true" /><sitemapResolverFactory ignore="true" /><linkExtractors><extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" keepReferrerData="true"/></linkExtractors><documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"><validStatusCodes>200</validStatusCodes><notFoundStatusCodes>404</notFoundStatusCodes></documentFetcher><metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" validStatusCodes="200" /><importer><preParseHandlers><filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.reference">.*(pdf|xls|xlsx|doc|docx|ppt|pptx)$</filter></preParseHandlers><postParseHandlers><tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"><fields>title,keywords,description,document.reference,document.contentType,collector.referrer-reference</fields></tagger></postParseHandlers></importer><committer class="com.norconex.committer.core.impl.FileSystemCommitter"><directory>$workdir\crawledFiles\KCN7</directory></committer></crawler>
essiembre commented 8 years ago

You have two options to make this work:

  1. Remove/comment your <metadataFetcher ..> entry.
  2. Add 301 (redirect) to your valid status code for your metadata fetcher, like this:
  <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher">
      <validStatusCodes>200,301</validStatusCodes>
  </metadataFetcher>

Redirects are currently not followed when relying on metadata fetcher to find out if a document is valid or not (it gets rejected and does not proceed further). The above solutions fixes this.

I am marking this as a feature request to support redirects (301) when using a metadata fetcher so that you do not have to declare 301 as a valid status code.

doaa-khaled commented 8 years ago

yes I tried this configuration and now it works.. many thanks :)

essiembre commented 8 years ago

The latest snapshot resolves this so you do not have to specify 301 as a valid status code or removing the metadataFetcher entry.