Closed AntonioAmore closed 10 years ago
Good morning,
Just to confirm I understand properly. You want to:
1- Monitor a bunch of web site (a 100 or more) but not necessarily in real time.
2- Pick up any new pages from those web sites (not updated ones but new one only)
3- Get the text and just the text of those new pages.
4- Be able to add (or remove) from the list of sites to monitor
If that it is then yes, it is definitively doable with the http-collector. Is there more that I am missing?
A few quick questions before I start digging into this further:
1- Where do you want that “text” to go once it has been retrieved?
2- How big are the sites you want to monitor?
3- How often do you want those sites to be monitored? (you said not real time but is it daily, hourly, etc.)
Regards,
David
Hi David,
Thank you for quick reply. You understand me absolutely correctly.
Please find answers for your questions here:
I should note, that filtering from HTML tags may be provided on my application's side also, so it isn't critical.
Thanks again,
Anton
"So I may write communication pipe between the crawler and the app using some well known techniques"
For the http-collector, a communicaiton pipe between the crawler and the app would be a "committer". Which you can specify in the config of the collector. There are 3 committers available open source right now: SOLR, IDOL, ElasticSearch but a custom one should be really easy to do if you want to go directly to your application.
For the rest, monitoring the site is pretty much what the http-collector does. The default config file that comes with the http-collector is good enough. Simply schedule the collector to run with either cron or windows scheduler (depending on your platform) to run on a daily basis. If the site is big, it may take a while to go over an entire site. If it is small, then you can schedule it more often. You can have one running http-collector per site, one config file per site, which then become pretty simple to add or remove sites, adding or removing config file. Or you can do them all in one config files and only have one collector running to which you can add or remove sites (and restart the collector)
As for getting only the new pages, that one I am not sure. Without looking into the code to deeply I would probably address that in the destination repository. In SOLR, using the SOLR committer for example, that would be pretty simple to do a date range query to see when something was added or updated (new vs updated).
Hope this helps.
David
David,
Thank you for the answers, it really helps.
Will http-collector fetch all pages every time it running for the same site? I'm trying to estimate time and disk space used per large site.
Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index?
1- Will http-collector fetch all pages every time it running for the same site? It will check if every page has changed yes. Will not necessarily download it every time.
2-Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index?
Yes the http-collector has its data store on what has been crawled.
Thanks a lot for your help, David,
I get more familiar with collector-http, and seems I'm able to try build the software I need with the crawler + custom commiter.
If you'll get chance to find how to recognize new page in commiter please let me know, I'll try to find it myself also.
Have nice day!
Best regards, Anton
PS should I close the issue, or mark it solved?
Sorry for jumping in...
if you are curious, the “default” database that the collector uses to store the list of URL visited is mapdb http://www.mapdb.org/
Martin
On Jun 5, 2014, at 10:50 AM, AntonioAmore notifications@github.com wrote:
Thanks a lot for your help, David,
I get more familiar with collector-http, and seems I'm able to try build the software I need with the crawler + custom commiter.
If you'll get chance to find how to recognize new page in commiter please let me know, I'll try to find it myself also.
Have nice day!
Best regards, Anton
PS should I close the issue, or mark it solved?
On Thursday, June 5, 2014 5:35 PM, David Gaulin notifications@github.com wrote:
1- Will http-collector fetch all pages every time it running for the same site? It will check if every page has changed yes. Will not necessarily download it every time. 2-Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index? Yes the http-collector has its data store on what has been crawled. — Reply to this email directly or view it on GitHub. — Reply to this email directly or view it on GitHub.
To add to David answers about crawling only new documents, the HTTP Collector stores a checksum of a document to find out if it has changed or not or is new. You can write your own checksum logic, but the one provided out of the box is usually sufficiant. There are two kinds of checksum mechanism: 1. Http-header based, and 2. content-based. If for instance the last-updated date is always valid in the HTTP response header, you may use this value to check whether a page has been modified or not (or is new). That way, it won't have to download the entire content to find that out and it saves you and the remote webserver bandwidth. If you can't rely on the HTTP header date (often the case with many web sites), the checksum based on content is probably what you need.
Here are two classes that may help you get started and how to configure them if you want to fine-tune the defaults:
HTTP Header Checksum: http://www.norconex.com/product/collector-http/apidocs/com/norconex/collector/http/checksum/impl/DefaultHttpHeadersChecksummer.html HTTP Content Checksum: http://www.norconex.com/product/collector-http/apidocs/com/norconex/collector/http/checksum/impl/DefaultHttpDocumentChecksummer.html
Thank you, guys!
You provided here precious information. I may use crawler's mapdb skipping Solr to save resources, and operate by crawled data with database's SQL-like query language. Guess the DB may have a field like 'added' or 'fetch_date'.
Also content-oriented checksum should play nicely. A custom commiter will save all crawled pages as a plaintext to disk (of course skipping already crawled by checksum), somehow I'll link those files with query's results and process yielded subset of files. Then the directory may be cleared and ready for the next crawl.
Seems architecture draft is ready, I may create architectural spike. Would you like to receive feedback here?
Thanks all again, I like this crawler more and more :)
All feedback is always welcome! And code contribution also J
Anything you come up with that the community could enjoy, please don’t hesitate to post it back.
David
Feedback is always appreciated! For the link database, you can switch it to a Apache Derby (for SQL) or Mongo database implementation (or even create your own):
<crawlURLDatabaseFactory
class="com.norconex.collector.http.db.impl.derby.DerbyCrawlURLDatabaseFactory" />
or
<crawlURLDatabaseFactory
class="com.norconex.collector.http.db.impl.mongo.MongoCrawlURLDatabaseFactory" />
To save crawled files on the filesystem, you may want to have a look at the FileSystemCommitter, in the base committer library (packaged with the HTTP Collector). It does just that: saves to a location of your choice on filesystem. Give it a try first in case it suits your needs already: http://www.norconex.com/product/committer/apidocs/com/norconex/committer/impl/FileSystemCommitter.html
I am closing this ticket as you seem to have answers to get going, but I am pretty sure you can post your feedback here regardless. Just open another one if you face new issues/challenges.
I would like to know may collector-http fit requirements for the following task (I appreciate your precious time, and read documentation first, but haven't found some nuances):
A set of hundreds web sites need to be monitored for new (not updated) content appeared after fixed date (starting from now, or the nearest future) and provide detected pages' text (and only text, as cleared as possible from markup and ads) by a reasonable way to another software: it may be CSV, JSON, socket transmission, etc. Monitoring starts form a moment and continues for a fixed time period, no realtime reaction needed. The list of sites may be supplemented, but newly added sites' content shouldn't be completely replicated, need as described above only new pages from date of the site adding to the list.
I don't expect fully written for me configuration files/plugins/etc, but please, set me direction, and tell is it possible at all with collector-http. It is very specific task and ALL crawlers I checked don't provide this functional from box, provide no possibility to get it from documentation.
Thank you for your assistance, and hope this nice software helps me with my research tasks.