Crawler strategy classes

GoogleCodeExporter commented 9 years ago

Currently, each crawl option fine-tuning has to be done per crawl param.
There is no concept of "crawler strategies" which will serve as a logical
umbrella of params. 

A parallel is the "Profiles" used in mobile phones, which implement
specific behavior grouping. Similarly, harvestman can implement behavior
groupings which will provide specific beahvior by grouping together params
in a logical way.

Examples.

1. DomainDirectoryCrawl (fetchlevel: 0) 
2. DomainCrawl          (fetchlevel: 1)
3. FirstLevelExternalDomainCrawl (fetchlevel: 2)
4. RandomWalkCrawl (Do a random walk...)

This should also implement "CrawlControl" classes which will provide
groupings of params under the <control>...</control> element. 

Each of these strategy classes will accept optional params for filtering.

The main aim for this feature is to improve developer usability and
programmatic crawling. This does not help the non-expert user in anyway...

Original issue reported on code.google.com by abpil...@gmail.com on 23 Jun 2008 at 2:31

GoogleCodeExporter commented 9 years ago

What about a memory profile option:
1. MemoryUsage: 0-3 (how much the algorithm will utilize the memory). I am 
having
problems with the crawler on large websites.. making it to hang when it reaches 
a
high memory consumption.

Original comment by andrei.p...@gmail.com on 16 Jul 2008 at 6:04

GoogleCodeExporter commented 9 years ago

Have you tried the crawler recently ? I recently fixed a bug which takes care of
flushing downloaded data to temporary files on the disk while reading data from 
the
web. So far reads used to be just a single "read()" on the URL fileobject. With 
this
fix, I added a new HarvestManFileObject which reads data block by block and 
flushes
it to temporary files on the disk. Once read is completed the temporary files 
are
moved to the final destination. 

This should fix most of the memory problems. Let me know if you find an 
improvement
in memory with this fix. You need to sync your code with subversion. 

Btw, this is issue #6.

Original comment by abpil...@gmail.com on 17 Jul 2008 at 5:15

GoogleCodeExporter commented 9 years ago

I am reducing the priority of this to low, since I am no longer working on this 
and
it is a developer feature anyway.

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:37

Added labels: Priority-Low
Removed labels: Priority-High

Letractively / harvestman-crawler

Crawler strategy classes #4