Tune up collector-http for specific features

AntonioAmore commented 10 years ago

I would like to know may collector-http fit requirements for the following task (I appreciate your precious time, and read documentation first, but haven't found some nuances):

A set of hundreds web sites need to be monitored for new (not updated) content appeared after fixed date (starting from now, or the nearest future) and provide detected pages' text (and only text, as cleared as possible from markup and ads) by a reasonable way to another software: it may be CSV, JSON, socket transmission, etc. Monitoring starts form a moment and continues for a fixed time period, no realtime reaction needed. The list of sites may be supplemented, but newly added sites' content shouldn't be completely replicated, need as described above only new pages from date of the site adding to the list.

I don't expect fully written for me configuration files/plugins/etc, but please, set me direction, and tell is it possible at all with collector-http. It is very specific task and ALL crawlers I checked don't provide this functional from box, provide no possibility to get it from documentation.

Thank you for your assistance, and hope this nice software helps me with my research tasks.

davidgaulin commented 10 years ago

Good morning,

Just to confirm I understand properly. You want to:

1- Monitor a bunch of web site (a 100 or more) but not necessarily in real time.

2- Pick up any new pages from those web sites (not updated ones but new one only)

3- Get the text and just the text of those new pages.

4- Be able to add (or remove) from the list of sites to monitor

If that it is then yes, it is definitively doable with the http-collector. Is there more that I am missing?

A few quick questions before I start digging into this further:

1- Where do you want that “text” to go once it has been retrieved?

2- How big are the sites you want to monitor?

3- How often do you want those sites to be monitored? (you said not real time but is it daily, hourly, etc.)

Regards,

David

AntonioAmore commented 10 years ago

Hi David,

Thank you for quick reply. You understand me absolutely correctly.

Please find answers for your questions here:

I have custom-written science application for a kind of statistical research: I'm a PhD in Informatics. So I may write communication pipe between the crawler and the app using some well known techniques. For example crawler may put pages to database, or as XML/CSV files to a directory and then the app touched by cron may check that directory, or a table for new information to process.
Sites from the list may vary from small size to large portal. I don't plan to collect shops or public libraries content, but newspapers, popular blogs, etc.
Optimal period of monitoring is a day.

I should note, that filtering from HTML tags may be provided on my application's side also, so it isn't critical.

Thanks again,

Anton

davidgaulin commented 10 years ago

"So I may write communication pipe between the crawler and the app using some well known techniques"

For the http-collector, a communicaiton pipe between the crawler and the app would be a "committer". Which you can specify in the config of the collector. There are 3 committers available open source right now: SOLR, IDOL, ElasticSearch but a custom one should be really easy to do if you want to go directly to your application.

For the rest, monitoring the site is pretty much what the http-collector does. The default config file that comes with the http-collector is good enough. Simply schedule the collector to run with either cron or windows scheduler (depending on your platform) to run on a daily basis. If the site is big, it may take a while to go over an entire site. If it is small, then you can schedule it more often. You can have one running http-collector per site, one config file per site, which then become pretty simple to add or remove sites, adding or removing config file. Or you can do them all in one config files and only have one collector running to which you can add or remove sites (and restart the collector)

As for getting only the new pages, that one I am not sure. Without looking into the code to deeply I would probably address that in the destination repository. In SOLR, using the SOLR committer for example, that would be pretty simple to do a date range query to see when something was added or updated (new vs updated).

Hope this helps.

David

AntonioAmore commented 10 years ago

David,

Thank you for the answers, it really helps.

Will http-collector fetch all pages every time it running for the same site? I'm trying to estimate time and disk space used per large site.

Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index?

davidgaulin commented 10 years ago

1- Will http-collector fetch all pages every time it running for the same site? It will check if every page has changed yes. Will not necessarily download it every time.

2-Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index?

Yes the http-collector has its data store on what has been crawled.

AntonioAmore commented 10 years ago

Thanks a lot for your help, David,

I get more familiar with collector-http, and seems I'm able to try build the software I need with the crawler + custom commiter.

If you'll get chance to find how to recognize new page in commiter please let me know, I'll try to find it myself also.

Have nice day!

Best regards, Anton

PS should I close the issue, or mark it solved?

martinfou commented 10 years ago

Sorry for jumping in...

if you are curious, the “default” database that the collector uses to store the list of URL visited is mapdb http://www.mapdb.org/

Martin

On Jun 5, 2014, at 10:50 AM, AntonioAmore notifications@github.com wrote:

Thanks a lot for your help, David,

I get more familiar with collector-http, and seems I'm able to try build the software I need with the crawler + custom commiter.

If you'll get chance to find how to recognize new page in commiter please let me know, I'll try to find it myself also.

Have nice day!

Best regards, Anton

PS should I close the issue, or mark it solved?

On Thursday, June 5, 2014 5:35 PM, David Gaulin notifications@github.com wrote:

1- Will http-collector fetch all pages every time it running for the same site? It will check if every page has changed yes. Will not necessarily download it every time. 2-Am I right if suppose that http-collector has his own database of visited urls, like a graph in addition to Solr (or other destination of commit) index? Yes the http-collector has its data store on what has been crawled. — Reply to this email directly or view it on GitHub. — Reply to this email directly or view it on GitHub.

essiembre commented 10 years ago

To add to David answers about crawling only new documents, the HTTP Collector stores a checksum of a document to find out if it has changed or not or is new. You can write your own checksum logic, but the one provided out of the box is usually sufficiant. There are two kinds of checksum mechanism: 1. Http-header based, and 2. content-based. If for instance the last-updated date is always valid in the HTTP response header, you may use this value to check whether a page has been modified or not (or is new). That way, it won't have to download the entire content to find that out and it saves you and the remote webserver bandwidth. If you can't rely on the HTTP header date (often the case with many web sites), the checksum based on content is probably what you need.

Here are two classes that may help you get started and how to configure them if you want to fine-tune the defaults:

HTTP Header Checksum: http://www.norconex.com/product/collector-http/apidocs/com/norconex/collector/http/checksum/impl/DefaultHttpHeadersChecksummer.html HTTP Content Checksum: http://www.norconex.com/product/collector-http/apidocs/com/norconex/collector/http/checksum/impl/DefaultHttpDocumentChecksummer.html

AntonioAmore commented 10 years ago

Thank you, guys!

You provided here precious information. I may use crawler's mapdb skipping Solr to save resources, and operate by crawled data with database's SQL-like query language. Guess the DB may have a field like 'added' or 'fetch_date'.

Also content-oriented checksum should play nicely. A custom commiter will save all crawled pages as a plaintext to disk (of course skipping already crawled by checksum), somehow I'll link those files with query's results and process yielded subset of files. Then the directory may be cleared and ready for the next crawl.

Seems architecture draft is ready, I may create architectural spike. Would you like to receive feedback here?

Thanks all again, I like this crawler more and more :)

davidgaulin commented 10 years ago

All feedback is always welcome! And code contribution also J

Anything you come up with that the community could enjoy, please don’t hesitate to post it back.

David

essiembre commented 10 years ago

Feedback is always appreciated! For the link database, you can switch it to a Apache Derby (for SQL) or Mongo database implementation (or even create your own):

<crawlURLDatabaseFactory 
    class="com.norconex.collector.http.db.impl.derby.DerbyCrawlURLDatabaseFactory" />

or

<crawlURLDatabaseFactory 
    class="com.norconex.collector.http.db.impl.mongo.MongoCrawlURLDatabaseFactory" />

To save crawled files on the filesystem, you may want to have a look at the FileSystemCommitter, in the base committer library (packaged with the HTTP Collector). It does just that: saves to a location of your choice on filesystem. Give it a try first in case it suits your needs already: http://www.norconex.com/product/committer/apidocs/com/norconex/committer/impl/FileSystemCommitter.html

essiembre commented 10 years ago

I am closing this ticket as you seem to have answers to get going, but I am pretty sure you can post your feedback here regardless. Just open another one if you face new issues/challenges.

Norconex / crawlers

Tune up collector-http for specific features #27