Closed AntonioAmore closed 10 years ago
Hello,
A simpler approach might be to configure DefaultHttpDocumentChecksummer
to use the document URL as the checksum value (field being "document.reference
"). You can also use the DefaultHttpHeadersChecksummer
if you do not want to download and extract URLs from documents you already processed. Here is an example with DefaultHttpDocumentChecksummer
:
<httpDocumentChecksummer
class="com.norconex.collector.http.checksum.impl.DefaultHttpDocumentChecksummer">
field="doc.reference" />
That way if the same URL is encountered on subsequent runs, it won't sent it to the committer since it will think it was unmodified (regardless whether it actually was or not).
Let me know if that works for you.
Thank you!
This solution shows how beautiful is collector's architecture and may really play for me.
What about httpHeadersChecksummer: should I turn it off to get effect I described above? Or it doesn't interfere?
If you do not care that a document may have changed and introduce new URLs to follow (which may lead to new documents), I suggest you do the logic I mentioned at the HTTP header level to minimize downloads. You can set the URL as the checksum value at that point, except the field is "collector.http.url". So that would give:
<httpHeadersChecksummer
class="com.norconex.collector.http.checksum.impl.DefaultHttpHeadersChecksummer">
field="collector.http.url" />
It does not hurt to have both set so you cover all angles.
Does this work for you now? Can we close this issue?
Hello,
It is a nice solution for my specification. It really helps. Thank you a lot! Collector seems very flexible and well designed software.
Sure, you may close the issue.
Thanks for the good words!
Hello,
I plan to write my own commiter, which writes only new (not updated) pages to a local filesystem.
Reading FileSystemCommiters sources I can't get, how to recognize the page is new. Is it possible in commiter? How can I detect it using collector/commiter methods without external data storage?
Some my ideas I don't like:
Thanks a lot.