Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

Commiter advanced logics #2

Closed AntonioAmore closed 10 years ago

AntonioAmore commented 10 years ago

Hello,

I plan to write my own commiter, which writes only new (not updated) pages to a local filesystem.

Reading FileSystemCommiters sources I can't get, how to recognize the page is new. Is it possible in commiter? How can I detect it using collector/commiter methods without external data storage?

Some my ideas I don't like:

  1. I can store all pages' hashes/urls to compare externally, but maybe much better re-use collector's database to save space.
  2. Check the reference variable at commiter's queueAdd() method input, but how to ask collector's db were the reference new one, or is just updated?

Thanks a lot.

essiembre commented 10 years ago

Hello,

A simpler approach might be to configure DefaultHttpDocumentChecksummer to use the document URL as the checksum value (field being "document.reference"). You can also use the DefaultHttpHeadersChecksummer if you do not want to download and extract URLs from documents you already processed. Here is an example with DefaultHttpDocumentChecksummer :

  <httpDocumentChecksummer 
      class="com.norconex.collector.http.checksum.impl.DefaultHttpDocumentChecksummer">
      field="doc.reference" />

That way if the same URL is encountered on subsequent runs, it won't sent it to the committer since it will think it was unmodified (regardless whether it actually was or not).

Let me know if that works for you.

AntonioAmore commented 10 years ago

Thank you!

This solution shows how beautiful is collector's architecture and may really play for me.

What about httpHeadersChecksummer: should I turn it off to get effect I described above? Or it doesn't interfere?

essiembre commented 10 years ago

If you do not care that a document may have changed and introduce new URLs to follow (which may lead to new documents), I suggest you do the logic I mentioned at the HTTP header level to minimize downloads. You can set the URL as the checksum value at that point, except the field is "collector.http.url". So that would give:

  <httpHeadersChecksummer 
      class="com.norconex.collector.http.checksum.impl.DefaultHttpHeadersChecksummer">
      field="collector.http.url" />

It does not hurt to have both set so you cover all angles.

essiembre commented 10 years ago

Does this work for you now? Can we close this issue?

AntonioAmore commented 10 years ago

Hello,

It is a nice solution for my specification. It really helps. Thank you a lot! Collector seems very flexible and well designed software.

Sure, you may close the issue.

essiembre commented 10 years ago

Thanks for the good words!