Closed comschmid closed 9 years ago
Thanks for the good words!
About the URL normalization ClientProtocolException, it should work as you expect. I tried with both http://example.com and http://www.etools.ch/sitemap_index.xml and they both worked. Can you share the exact URL causing the issue? That would help reproduce. A copy of your config would be nice.
About http vs https, you can use the "secureScheme" normalization rule and it will convert all http
into https
.
About the URLs extracted, you are right that only those that match your stayOnDomain="true"
will be recorded. The stayOn[...] attributes exist for convenience mostly. If you require more flexibility, I recommend you use reference filters instead to stay on your site, like the RegexReferenceFilter. That way you will have all the external URLs stored like you want (since the reference filters are triggered after the URL extraction step, which stores those extracted URLs that you want).
Thank you for your support!
A part of the configuration is done in code (mainly the directories and the seed URLs), but I added the relevant start URL to the crawler for illustration purposes, please see below:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This self-documented configuration file is meant to be used as a reference
or starting point for a new configuration.
It contains all core features offered in this release. Sometimes
multiple implementations are available for a given feature. Refer
to site documentation for more options and complete description of
each features.
-->
<httpcollector id="test collector">
<!-- Variables: Optionally define variables in this configuration file
using the "set" directive, or by using a file of the same name
but with the filterExtension ".variables" or ".properties". Refer
to site documentation to find out what each filterExtension does.
Finally, once can pass an optional properties file when starting the
crawler. The following is good practice to reference frequently
used classes in a shorter way.
-->
#set($core = "com.norconex.collector.core")
#set($http = "com.norconex.collector.http")
#set($committer = "com.norconex.committer")
#set($importer = "com.norconex.importer")
#set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
#set($filterRegexMeta = "${core}.filter.impl.RegexMetadataFilter")
#set($urlFilter = "${http}.filter.impl.RegexURLFilter")
#set($robotsTxt = "${http}.robot.impl.StandardRobotsTxtProvider")
#set($robotsMeta = "${http}.robot.impl.StandardRobotsMetaProvider")
#set($metaFetcher = "${http}.fetch.impl.GenericMetadataFetcher")
#set($docFetcher = "${http}.fetch.impl.GenericDocumentFetcher")
#set($linkExtractor = "${http}.url.impl.GenericLinkExtractor")
#set($canonLinkDetector = "${http}.url.impl.GenericCanonicalLinkDetector")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($sitemapFactory = "${http}.sitemap.impl.StandardSitemapResolverFactory")
#set($metaChecksummer = "${http}.checksum.impl.LastModifiedMetadataChecksummer")
#set($docChecksummer = "${core}.checksum.impl.MD5DocumentChecksummer")
#set($dataStoreFactory = "${core}.data.store.impl.mapdb.MapDBCrawlDataStoreFactory")
#set($spoiledStrategy = "${core}.spoil.impl.GenericSpoiledReferenceStrategizer")
<!-- Location where internal progress files are stored. -->
<!-- <progressDir>defined/in/code</progressDir>-->
<!-- Location where logs are stored. -->
<!-- <logsDir>defined/in/code</logsDir>-->
<!-- All crawler configuration options can be specified as default
(including start URLs). Settings defined here will be inherited by
all individual crawlers defined further down, unless overwritten.
If you replace a top level crawler tag from the crawler defaults,
all the default tag configuration settings will be replace, no
attempt will be made to merge or append.
-->
<crawlerDefaults>
<!-- Mandatory starting URL(s) where crawling begins. If you put more
than one URL, they will be processed together. You can also
point to one or more URLs files (i.e., seed lists). -->
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
<!-- <url>defined/in/code</url> -->
<!-- <urlsFile>defined/in/code</urlsFile> -->
</startURLs>
<!-- Identify yourself to sites you crawl. It sets the "User-Agent" HTTP
request header value. This is how browsers identify themselves for
instance. Sometimes required to be certain values for robots.txt
files.
-->
<userAgent>Mozilla/5.0 (compatible; TestCrawler/0.1; +http://www.comcepta.com)</userAgent>
<!-- Optional URL normalization feature. The class must implement
com.norconex.collector.http.url.IURLNormalizer,
like the following class does.
-->
<urlNormalizer class="$urlNormalizer">
<normalizations>lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, removeFragment, removeDotSegments, addTrailingSlash, removeDuplicateSlashes, removeSessionIds, upperCaseEscapeSequence</normalizations>
<replacements>
<replace>
<match>&view=print</match>
<replacement>&view=html</replacement>
</replace>
</replacements>
</urlNormalizer>
<!-- Optional delay resolver defining how polite or aggressive you want
your crawling to be. The class must implement
com.norconex.collector.http.delay.IDelayResolver.
The following is the default implementation:
scope="[crawler|site|thread]
-->
<delay default="1000" ignoreRobotsCrawlDelay="false" scope="site" class="${http}.delay.impl.GenericDelayResolver">
<!-- <schedule dayOfWeek="from Monday to Friday" time="from 8:00 to 6:30">5000</schedule> -->
</delay>
<!-- How many threads you want a crawler to use. Regardless of how many
thread you have running, the frequency of each URL being invoked
will remain dictated by the <delay/> option above. Using more
than one thread is a good idea to ensure the delay is respected
in case you run into single downloads taking more time than the
delay specified. Default is 2 threads.
-->
<numThreads>2</numThreads>
<!-- How many level deep can the crawler go. I.e, within how many clicks
away from the main page (start URL) each page can be to be considered.
Beyond the depth specified, pages are rejected.
The starting URLs all have a zero-depth. Default is -1 (unlimited)
-->
<maxDepth>5</maxDepth>
<!-- Stop crawling after how many successfully processed documents.
A successful document is one that is either new or modified, that was
not rejected, not deleted, or did not generate any error. As an
example, this is a document that will end up in your search engine.
Default is -1 (unlimited)
-->
<maxDocuments>-1</maxDocuments>
<!-- Crawler "work" directory. This is where files downloaded or created as
part of crawling activities (besides logs and progress) get stored.
It should be unique to each crawlers.
-->
<!-- <workDir>defined/in/code</workDir> -->
<!-- Keep downloaded files. Default is false.
-->
<keepDownloads>false</keepDownloads>
<!-- What to do with orphan documents. Orphans are valid
documents, which on subsequent crawls can no longer be reached when
running the crawler (e.g. there are no links pointing to that page
anymore). Available options are:
IGNORE (default), DELETE, and PROCESS.
-->
<orphansStrategy>IGNORE</orphansStrategy>
<!-- One or more optional listeners to be notified on various crawling
events (e.g. document rejected, document imported, etc).
Class must implement
com.norconex.collector.core.event.ICrawlerEventListener
-->
<crawlerListeners>
<!-- <listener class="defined.in.code"/> -->
</crawlerListeners>
<!-- Factory class creating a database for storing crawl status and
other information. Classes must implement
com.norconex.collector.core.data.store.ICrawlURLDatabaseFactory.
Default implementation is the following.
-->
<crawlDataStoreFactory class="$dataStoreFactory" />
<!-- Initialize the HTTP client use to make connections. Classes
must implement com.norconex.collector.http.client.IHttpClientFactory.
Default implementation offers many options. The following shows
a sample use of the default with credentials.
-->
<httpClientFactory class="$httpClientFactory">
<ignoreExternalRedirects>true</ignoreExternalRedirects>
<!-- These apply to any authentication mechanism -->
<!--
<authUsername>myusername</authUsername>
<authPassword>mypassword</authPassword>
-->
<!-- These apply to FORM authentication -->
<!--
<authUsernameField>field_username</authUsernameField>
<authPasswordField>field_password</authPasswordField>
<authURL>https://www.example.com/login.php</authURL>
-->
<!-- These apply to both BASIC and DIGEST authentication -->
<!--
<authHostname>www.example.com</authHostname>
<authPort>80</authPort>
<authRealm>PRIVATE</authRealm>
-->
</httpClientFactory>
<!-- Optionally filter URL BEFORE any download. Classes must implement
com.norconex.collector.core.filter.IReferenceFilter,
like the following examples.
-->
<referenceFilters>
<!-- Exclude images, CSS and JS files -->
<filter class="$filterExtension" onMatch="exclude" caseSensitive="false">jpg,jpeg,gif,png,ico,css,js</filter>
<!-- Exclude dynamic pages containing a query (?) -->
<filter class="$filterRegexRef" onMatch="exclude" caseSensitive="true">.+\?.*</filter>
</referenceFilters>
<!-- Filter BEFORE download with RobotsTxt rules. Classes must
implement *.robot.IRobotsTxtProvider. Default implementation
is the following.
-->
<robotsTxt ignore="false" class="$robotsTxt" />
<!-- Loads sitemap.xml URLs and adds adds them to URLs to process -->
<sitemap ignore="false" lenient="true" class="$sitemapFactory" />
<!-- Fetch a URL HTTP Headers. Classes must implement
com.norconex.collector.http.fetch.IHttpMetadataFetcher.
The following is a simple implementation.
-->
<metadataFetcher class="$metaFetcher">
<validStatusCodes>200</validStatusCodes>
</metadataFetcher>
<!-- Optionally filter AFTER download of HTTP headers. Classes must
implement com.norconex.collector.core.filter.IMetadataFilter.
-->
<metadataFilters>
<!-- Do not index content-type of CSS or JavaScript -->
<filter class="$filterRegexMeta" onMatch="exclude" caseSensitive="false" field="Content-Type">.*css.*|.*javascript.*</filter>
</metadataFilters>
<!-- Generates a checksum value from document headers to find out if
a document has changed. Class must implement
com.norconex.collector.core.checksum.IMetadataChecksummer.
Default implementation is the following.
-->
<metadataChecksummer disabled="false" keep="false" targetField="collector.checksum-metadata" class="$metaChecksummer" />
<!-- Detect canonical links. Classes must implement
com.norconex.collector.http.url.ICanonicalLinkDetector.
Default implementation is the following.
-->
<canonicalLinkDetector ignore="false" class="${canonLinkDetector}">
<contentTypes>
text/plain, text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
</contentTypes>
</canonicalLinkDetector>
<!-- Fetches document. Class must implement
com.norconex.collector.http.fetch.IHttpDocumentFetcher.
Default implementation is the following.
-->
<documentFetcher class="$docFetcher">
<validStatusCodes>200</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
</documentFetcher>
<!-- Establish whether to follow a page URLs or to index a given page
based on in-page meta tag robot information. Classes must implement
com.norconex.collector.http.robot.IRobotsMetaProvider.
Default implementation is the following.
-->
<robotsMeta ignore="false" class="$robotsMeta" />
<!-- Extract links from a document. Classes must implement
com.norconex.collector.http.url.ILinkExtractor.
Default implementation is the following.
-->
<linkExtractors>
<extractor class="${linkExtractor}" ignoreExternalLinks="false" maxURLLength="2048" ignoreNofollow="false" keepReferrerData="true" keepFragment="false">
<contentTypes>text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp</contentTypes>
<tags>
<tag name="a" attribute="href" />
<tag name="frame" attribute="src" />
<tag name="iframe" attribute="src" />
<tag name="img" attribute="src" />
<tag name="meta" attribute="http-equiv" />
<tag name="base" attribute="href" />
<tag name="form" attribute="action" />
<tag name="link" attribute="href" />
<tag name="input" attribute="src" />
</tags>
</extractor>
</linkExtractors>
<!-- Optionally filters a document. Classes must implement
com.norconex.collector.core.filter.IDocumentFilter-->
<documentFilters>
<!-- <filter class="YourClass" /> -->
</documentFilters>
<!-- Optionally process a document BEFORE importing it. Classes must
implement com.norconex.collector.http.doc.IHttpDocumentProcessor.
-->
<preImportProcessors>
<!-- <processor class="YourClass"></processor> -->
</preImportProcessors>
<!-- Import a document. This step calls the Importer module. The
importer is a different module with its own set of XML configuration
options. Please refer to Importer for complete documentation.
Below gives you an overview of the main importer tags.
-->
<importer>
<!--
<tempDir>defined/in/code</tempDir>
<maxFileCacheSize></maxFileCacheSize>
<maxFilePoolCacheSize></maxFilePoolCacheSize>
<parseErrorsSaveDir>defined/in/code</parseErrorsSaveDir>
-->
<preParseHandlers>
<tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.preparse" />
<!-- These tags can be mixed, in the desired order of execution. -->
<!--
<tagger class="..." />
<transformer class="..." />
<filter class="..." />
<splitter class="..." /> -->
</preParseHandlers>
<!-- <documentParserFactory class="..." /> -->
<postParseHandlers>
<!-- These tags can be mixed, in the desired order of execution. -->
<!-- follow HTML meta-equiv redirects without indexing original page -->
<filter class="${importer}.handler.filter.impl.RegexMetadataFilter" onMatch="exclude" property="refresh">.*</filter>
<!-- Collapse spaces and line feeds -->
<transformer class="${importer}.handler.transformer.impl.ReduceConsecutivesTransformer" caseSensitive="true">
<reduce>\s</reduce>
<reduce>\n</reduce>
<reduce>\r</reduce>
<reduce>\t</reduce>
<reduce>\n\r</reduce>
<reduce>\r\n</reduce>
<reduce>\s\n</reduce>
<reduce>\s\r</reduce>
<reduce>\s\r\n</reduce>
<reduce>\s\n\r</reduce>
</transformer>
<!-- Remove CSS -->
<transformer class="${importer}.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
<replace>
<fromValue>class=".*?"</fromValue>
<toValue></toValue>
</replace>
</transformer>
<transformer class="${importer}.handler.transformer.impl.StripBetweenTransformer" inclusive="true" >
<stripBetween>
<start><style.*?></start>
<end></style></end>
</stripBetween>
<stripBetween>
<start><script.*?></start>
<end></script></end>
</stripBetween>
</transformer>
<tagger class="${importer}.handler.tagger.impl.DocumentLengthTagger" field="document.size.postparse" />
<!-- Reject small documents (<100 Bytes)-->
<filter class="${importer}.handler.filter.impl.NumericMetadataFilter" onMatch="exclude" field="document.size.postparse" >
<condition operator="lt" number="100" />
</filter>
<!-- Rename fields -->
<tagger class="${importer}.handler.tagger.impl.RenameTagger">
<rename fromField="Keywords" toField="keywords" overwrite="true" />
</tagger>
<tagger class="${importer}.handler.tagger.impl.DeleteTagger">
<fields>Keywords</fields>
</tagger>
<!-- Unless you configured Solr to accept ANY fields, it will fail
when you try to add documents. Keep only the metadata fields provided, delete all other ones. -->
<tagger class="${importer}.handler.tagger.impl.LanguageTagger" shortText="false" keepProbabilities="false" fallbackLanguage="" />
<tagger class="${importer}.handler.tagger.impl.KeepOnlyTagger">
<fields>content, title, keywords, description, tags, collector.referrer-reference, collector.depth, document.reference, document.language, document.size.preparse, document.size.postparse</fields>
</tagger>
</postParseHandlers>
<!--
<responseProcessors>
<responseProcessor class="..." />
</responseProcessors>
-->
</importer>
<!-- Create a checksum out of a document to figure out if a document
has changed, AFTER it has been imported. Class must implement
com.norconex.collector.core.checksum.IDocumentChecksummer.
Default implementation is the following.
-->
<documentChecksummer class="$docChecksummer" disabled="false" keep="false" targetField="collector.checksum-doc">
<sourceFields>content, title, keywords, description</sourceFields>
</documentChecksummer>
<!-- Optionally process a document AFTER importing it. Classes must
implement com.norconex.collector.http.doc.IHttpDocumentProcessor.
-->
<postImportProcessors>
<!-- <processor class="YourClass"></processor> -->
</postImportProcessors>
<!-- Decide what to do with references that have turned bad.
Class must implement
com.norconex.collector.core.spoil.ISpoiledReferenceStrategizer.
Default implementation is the following.
-->
<spoiledReferenceStrategizer class="$spoiledStrategy" fallbackStrategy="DELETE">
<mapping state="NOT_FOUND" strategy="DELETE" />
<mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
<mapping state="ERROR" strategy="GRACE_ONCE" />
</spoiledReferenceStrategizer>
<!-- Commits a document to a data source of your choice.
This step calls the Committer module. The
committer is a different module with its own set of XML configuration
options. Please refer to committer for complete documentation.
Below is an example using the FileSystemCommitter.
-->
<!--
<committer class="${committer}.core.impl.FileSystemCommitter">
<directory>${workDir}/crawledFiles</directory>
</committer>
-->
<committer class="com.norconex.committer.solr.SolrCommitter">
<solrURL>http://localhost:8983/solr/test1</solrURL>
<solrUpdateURLParams>
<!-- multiple param tags allowed -->
<!--<param name="(parameter name)">(parameter value)</param> -->
</solrUpdateURLParams>
<commitDisabled>false</commitDisabled>
<sourceReferenceField keep="false">
<!-- Optional name of field that contains the document reference, when
the default document reference is not used. The reference value
will be mapped to Solr "id" field, or the "targetReferenceField"
specified.
Once re-mapped, this metadata source field is
deleted, unless "keep" is set to true. -->
document.reference
</sourceReferenceField>
<targetReferenceField>
<!-- Name of Solr target field where the store a document unique
identifier (idSourceField). If not specified, default is "id". -->
id
</targetReferenceField>
<!--
<sourceContentField keep="[false|true]>";
(If you wish to use a metadata field to act as the document
"content", you can specify that field here. Default
does not take a metadata field but rather the document content.
Once re-mapped, the metadata source field is deleted,
unless "keep" is set to true.)
</sourceContentField>
-->
<targetContentField>
<!-- Solr target field name for a document content/body.
Default is: content -->
content
</targetContentField>
<!-- <queueDir>defined/in/code</queueDir> -->
<queueSize>10</queueSize>
<commitBatchSize>5</commitBatchSize>
<maxRetries>1</maxRetries>
<maxRetryWait>5000</maxRetryWait>
</committer>
</crawlerDefaults>
<!-- Individual crawlers can be defined here. All crawler default
configuration settings will apply to all crawlers created unless
explicitly overwritten in crawler configuration.
For configuration options where multiple items can be present
(e.g. filters), the whole list will in crawler defaults would be
overwritten.
Since the options are the same as the defaults above, the documentation
is not repeated here.
The only difference from "crawlerDefaults" is the addition of the "id"
attribute on the crawler tag. The "id" attribute uniquelly identifies
each of your crawlers.
-->
<crawlers>
<crawler id="test crawler">
<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
<url>http://www.etools.ch</url>
</startURLs>
</crawler>
</crawlers>
</httpcollector>
Since I posted the config.xml, another question concerning taggers comes into my mind: As you can see, I use a Solr committer and limit the fields to the ones defined with the KeepOnlyTagger. Unfortunately, some test documents contain a field named 'Keywords' instead of 'keywords', but the Solr schema fields are case-sensitive and therefore Solr rejects such documents. The (ugly) workaround is to define a RenameTagger and a DeleteTagger, but I probably missed a better way to simply lowercase all fields that are configured in the KeepOnlyTagger?
About http vs https, you can use the "secureScheme" normalization rule and it will convert all http into https.
Does this also work with sites supporting only http
?
My idea was more like: if we fetched the same page with the http
and https
scheme, only https
should be indexed.
Thanks for the tip with the RegexReferenceFilter instead of enabling the stayOnDomain attribute. In general it works, but what I see now in the log are sporadic errors or warnings about external references, e.g. erroneous attempts to download the robot.txt or cookie warnings. It seems that with only the RegexReferenceFilter in place, external references don't get imported (they get rejected), but at least the robots.txt of the external reference is fetched. Can I avoid that?
Forward-slashes being messed up: I was able to reproduce with your config. I will investigate.
Field names of mixed character cases:
The CharacterCaseTagger
takes care of changing case-sensitivity to your liking for field values, but not field names. It should easy to allow the same thing to happen for field names. I created this feature request: #166.
Favor one scheme when both exists for the same page: There is no way to do this right now since it would require doing a second pass through all crawled URLs when crawling is complete and all URL have been found to find such cases. But by then, unless you create a committer that sends its documents only at the very end of a crawl, your documents will already be sent to Solr (we could send deletion requests, but it would be a ugly hack). Your best bet is to review the site you crawl and see if all http pages can also be asscessed via https and if so simply force the normalization to be https. Otherwise, I suspect implementing this feature could take some time. One idea would be that for each http pages, the crawler could optionally check if an https version exists, but that will be an extra call all the time and not ideal. You can create a new issue for this one and make it a feature request if you really see the benefit, but I am not guaranteeing anything on that one yet.
Thanks for your investigation!
The CharacterCaseTagger
looks like the right place to do that.
Concerning preferences of https vs. http: There is really no need to create a feature request for this specific feature, I just asked you in case I oversaw it. I think I will first check with the seed URL if it is available with https. Like that the chances are good, that the rest of the pages are also available with https.
One last question: I realized that the Solr committer creates a lot of commit queue sub directories ordered by date and time. After a successful run, all directories are empty but stay there. Is there a reason for that, or can I simply delete these directories afterwards? For a huge run, it would be probably better to delete each directory after a successful commit. Maybe this feature could be added to the AbstractBatchCommitter or AbstractMappedCommitter?
There is a new snapshot with the fix to the URLNormalizer as well as the character case tagger (see ticket #166). Please test and report.
For the Solr Committer, it is supposed to delete empty directories that are old enough (does not risk being written into again). I suppose the last ones written before the crawler ends are never deleted. You can certainly delete them when crawling is over. I suggest you open a separate issue for this either in the Solr Committer project or the Committer Core one.
Now, the URLNormalizer does not produce anymore invalid URLs, which was sometimes the problem when specifying addTrailingSlash
.
Concerning the duplicate pages, I can still see both versions, e.g. http://www.etools.ch vs. http://www.etools.ch/, but only when using http://www.etools.ch as the start URL. I can change the start URL easily myself, which fixes this cornerstone issue.
After investigating the deletion process of old commit queue directories, I have to admit that I didn't refresh or was waiting long enough: the empty directories get deleted after about 10 seconds.
Maybe the name of the rule is not self-explanatory enough. It will not add a forward slash to EVERY URLs, but only those where the last segment "looks like" a directory. It is mentioned in the doc here.
So think of it as addTrailingSlashToURLsThatLooksLikeDirectories
. :-)
Since it is otherwise working for you, I will close this issue. Re-open or create another one if you witness another issue with this one.
First let me thank you for this wonderful piece of software!
I am using 2.3.0-SNAPSHOT and would like to avoid duplicate pages like http://example.com and http://example.com/.
So I tried to configure
<urlNormalizer>
with the following normalizations: lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, removeFragment, removeDotSegments, addTrailingSlash, removeDuplicateSlashes, removeSessionIds, upperCaseEscapeSequence.I hoped that the combination of addTrailingSlash and removeDuplicateSlashes would help to solve this issue, but unfortunately this produced ClientProtocolExceptions like 'URI does not specify a valid host name: http:/www.etools.ch//robots.txt' or the following log entry:
(three wrong slashes at two places)
Removing the normalization enum addTrailingSlash solves this issue, but this configuration option should behave predictable.
What would be also nice, if http://example.com/ and https://example.com/ would be reduced to only the HTTPS version, assuming the meta and content MD5-checksums are equals.
BTW: is it possible to retrieve also external references, even with the setting
<startURLs stayOnDomain="true">
? I tried to get these external references with the crawler event type "URLS_EXTRACTED", but I only see domain-specific references in the corresponding HashSet. I don't want to crawl the external references, but just record them.