Recrawling and checksums

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

Recrawling and checksums #77

Closed OkkeKlein closed 9 years ago

OkkeKlein commented 9 years ago

Does the crawler only look at HttpMetadataChecksummer or also the documentChecksummer to decide whether to redownload pages?

A combination of content and modified date would give better indication whether redownload is needed in my use case.

essiembre commented 9 years ago

It does both. It checks the Last-Modified date in the HTTP header before downloading the document. If the doc has changed or if no Last-Modified was present, it downloads it and at a later point, will performs a content checksum.

OkkeKlein commented 9 years ago

I am using a lot of titles and other meta data from referrers. If those change, the link that is being referred to is not redownloaded, because it's Last-Modified didn't change. A checksum (MD5) built from multiple tag values would change that.

essiembre commented 9 years ago

I am marking this issue as a feature request for this: being able to specify multiple fields at once for checksum creation.
In the meantime you can accomplish the same behavior by relying on the document checkummer, after merging the fields you want yourself. An example:

<crawler id="MyCrawler">
...
<importer>
  <postParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
      <copy fromField="Last-Modified" toField="combinedValues" overwrite="false" />
      <copy fromField="AnotherField" toField="combinedValues" overwrite="false" />
    </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
      <singleValue field="combinedValues" />
    </tagger>
  </postParseHandlers>
</importer>
<documentChecksummer class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
    sourceField="combinedValues" />
...
</crawler>

As an alternative, you can create your own Metadata Checksummer by extending AbstractMetadataChecksummer.

OkkeKlein commented 9 years ago

Nice solution.

Trying to recrawl, the starting URL is rejected because of not modified, which is fine, but all the other links in the MongoDB are not checked if they were modified. The crawler just ends.

Tested both with -start and -resume in http collector 2.1.0

essiembre commented 9 years ago

Is this happening for you only with the Mongo implementation? Looks like a bug to me. Since this mongo issue is not related to your original question, can you respond by opening a new issue? Thanks.

OkkeKlein commented 9 years ago

Can not reproduce, so closing the issue.

essiembre commented 9 years ago

Let's keep it open until the feature-request gets implemented (i.e. "specify multiple fields at once for checksum creation").

essiembre commented 9 years ago

metadataChecksummer: There is now a new class for specifying multiple fields for a meta data checksum: GenericMetadataChecksummer.

documentChecksummer: MD5DocumentChecksummer was modified to accept multiple fields for a checksum.

Those are in the 2.2.0 official release.