Closed OkkeKlein closed 9 years ago
It does both. It checks the Last-Modified date in the HTTP header before downloading the document. If the doc has changed or if no Last-Modified was present, it downloads it and at a later point, will performs a content checksum.
I am using a lot of titles and other meta data from referrers. If those change, the link that is being referred to is not redownloaded, because it's Last-Modified didn't change. A checksum (MD5) built from multiple tag values would change that.
I am marking this issue as a feature request for this: being able to specify multiple fields at once for checksum creation.
In the meantime you can accomplish the same behavior by relying on the document checkummer, after merging the fields you want yourself. An example:
<crawler id="MyCrawler">
...
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
<copy fromField="Last-Modified" toField="combinedValues" overwrite="false" />
<copy fromField="AnotherField" toField="combinedValues" overwrite="false" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
<singleValue field="combinedValues" />
</tagger>
</postParseHandlers>
</importer>
<documentChecksummer class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
sourceField="combinedValues" />
...
</crawler>
As an alternative, you can create your own Metadata Checksummer by extending AbstractMetadataChecksummer.
Nice solution.
Trying to recrawl, the starting URL is rejected because of not modified, which is fine, but all the other links in the MongoDB are not checked if they were modified. The crawler just ends.
Tested both with -start and -resume in http collector 2.1.0
Is this happening for you only with the Mongo implementation? Looks like a bug to me. Since this mongo issue is not related to your original question, can you respond by opening a new issue? Thanks.
Can not reproduce, so closing the issue.
Let's keep it open until the feature-request gets implemented (i.e. "specify multiple fields at once for checksum creation").
metadataChecksummer: There is now a new class for specifying multiple fields for a meta data checksum: GenericMetadataChecksummer.
documentChecksummer: MD5DocumentChecksummer was modified to accept multiple fields for a checksum.
Those are in the 2.2.0 official release.
Does the crawler only look at HttpMetadataChecksummer or also the documentChecksummer to decide whether to redownload pages?
A combination of content and modified date would give better indication whether redownload is needed in my use case.