Closed doaa-khaled closed 8 years ago
It should work. In several cases when it does not work, it is because you have rules that excludes them. While that is not the example you provided, one frequent case is a home page that redirects to a URL with a different domain or protocol while you have stayOnDomain
and/or stayOnProtocol
set to true. You can relax those settings and use your own filters instead.
In the case of Wikipedia above, I suspect you have settings excluding them. It works when I reproduce with just the XML you provided:
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://en.wikipedia.org/</url>
</startURLs>
this is my configuration, can you check if please ?
<?xml version="1.0" encoding="UTF-8"?><httpcollector id="kcn">#set($workdir = "D:\crawler-ouput")<progressDir>$workdir\progress</progressDir><logsDir>$workdir\logs</logsDir><crawlers><crawler id="Kycon7"><startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true"><url>https://en.wikipedia.org</url></startURLs><workDir>$workdir</workDir><delay default="30000" /><numThreads>15</numThreads><maxDepth>-1</maxDepth><maxDocuments>-1</maxDocuments><keepDownloads>false</keepDownloads><listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener"><outputDir>$workdir</outputDir></listener><crawlDataStoreFactory class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" /><referenceFilters><filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude" caseSensitive="false" >jpg,gif,png,ico,css,js</filter></referenceFilters><robotsTxt ignore="true" /><robotsMeta ignore="true" /><sitemapResolverFactory ignore="true" /><linkExtractors><extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" keepReferrerData="true"/></linkExtractors><documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"><validStatusCodes>200</validStatusCodes><notFoundStatusCodes>404</notFoundStatusCodes></documentFetcher><metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" validStatusCodes="200" /><importer><preParseHandlers><filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.reference">.*(pdf|xls|xlsx|doc|docx|ppt|pptx)$</filter></preParseHandlers><postParseHandlers><tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"><fields>title,keywords,description,document.reference,document.contentType,collector.referrer-reference</fields></tagger></postParseHandlers></importer><committer class="com.norconex.committer.core.impl.FileSystemCommitter"><directory>$workdir\crawledFiles\KCN7</directory></committer></crawler>
You have two options to make this work:
<metadataFetcher ..>
entry. <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher">
<validStatusCodes>200,301</validStatusCodes>
</metadataFetcher>
Redirects are currently not followed when relying on metadata fetcher to find out if a document is valid or not (it gets rejected and does not proceed further). The above solutions fixes this.
I am marking this as a feature request to support redirects (301) when using a metadata fetcher so that you do not have to declare 301 as a valid status code.
yes I tried this configuration and now it works.. many thanks :)
The latest snapshot resolves this so you do not have to specify 301 as a valid status code or removing the metadataFetcher entry.
I used to insert URL without writing the home page, as an example: `
instead doing
`
in first case it doesn't crawl, but in the second it works fine, the problem is that i have a huge number of URLs written by first way.. Is there a way in configuration to accept the first formula ?