Support the crawling of a relatively unlimited number of sites/pages by a single crawler instance

GoogleCodeExporter commented 8 years ago

-No db dependency

Original issue reported on code.google.com by sjdir...@gmail.com on 18 Apr 2013 at 9:45

Blocking: #114

GoogleCodeExporter commented 8 years ago

Christoph Engelhardt wrote...

There's another alternative to storing the URLs in memory: 
compile a hash for the URL and create a file on disk with the hash as its name. 
Then you can just File.Exists(hash) to see whether the URL has been crawled.
That may seem crude, but since crawling is usually a network-IO-bound task this 
might work quite decently

Original comment by sjdir...@gmail.com on 18 Apr 2013 at 9:50

GoogleCodeExporter commented 8 years ago

The problem is keeping track if what pages were crawled. Right now abot stores 
them in memory since the majority of Abot's users have to crawl a finite number 
of sites/pages that usually falls into the tens of thousands, not millions and 
billions.

interface ICrawledUrlRepository
{
  Contains(Uri uri)
  AddIfNew(Uri uri)
}

The default impl would be in memory, like it currently is and then anyone could 
create an implementation backed by their db or filesystem or distributed cache, 
etc....

Original comment by sjdir...@gmail.com on 18 Apr 2013 at 10:00

GoogleCodeExporter commented 8 years ago

Also make maxCrawlDepth a long or consider changing the maxCrawlDepth of 0 to 
mean unlimited. 

Also make maxPagesToCrawl a long, but i do believe a value of 0 is infinite.

Original comment by sjdir...@gmail.com on 18 Apr 2013 at 10:02

GoogleCodeExporter commented 8 years ago

maxCrawlDepth should be able to be set to -1 if they want it disabled

Original comment by sjdir...@gmail.com on 19 Apr 2013 at 6:36

GoogleCodeExporter commented 8 years ago

Original comment by sjdir...@gmail.com on 27 Apr 2013 at 7:27

Added labels: Milestone-Release1.2

GoogleCodeExporter commented 8 years ago

***If the crawl is using more then maxMemoryUsageInMb then switch from in 
memory list of crawled pages to the impl that uses files with hashed urls as 
the file name.

Pros:
  Will still be just as fast until the limit is hit
Cons:
  Could still cause a oem exception if they set the value to high

***Another option would be to add a config value to either use an in memory 
list or a hashed file list.

Original comment by sjdir...@gmail.com on 11 Jun 2013 at 11:14

GoogleCodeExporter commented 8 years ago

Has anyone started this?  If not I would be willing to do this and submit code 
in the next few days.

Original comment by ilushk...@gmail.com on 25 Jun 2013 at 10:15

GoogleCodeExporter commented 8 years ago

No one has started that task. Please feel free to give it a crack. Post
questions to the forum if you get stuck. Thanks for helping out!

Steven

Original comment by sjdir...@gmail.com on 26 Jun 2013 at 2:36

GoogleCodeExporter commented 8 years ago

Submitting code shortly for this.   What do you think about limiting the in 
memory storage of urls to crawl?  that can grow very large too

Original comment by i...@resultly.com on 7 Jul 2013 at 1:05

GoogleCodeExporter commented 8 years ago

Can you add permission so i can send changes direct to svn.   Here are the 
files changed for this initial contribution.  It does not yet support automatic 
switching, but it is now a parameter that is optional to fifoscheduler.  
Default is old method which is in memory.  Attached are the changed files based 
on latest 1.2 branch.

Original comment by i...@resultly.com on 7 Jul 2013 at 1:55

Attachments:

GoogleCodeExporter commented 8 years ago

Here is a variation of the fileurl repository without using locks.. and using 
loops with thread.wait during read if neccessary.  I believe this will be 
faster and allow writing to fs without locks.

Original comment by i...@resultly.com on 7 Jul 2013 at 3:33

Attachments:

FileUrlRepository.cs

GoogleCodeExporter commented 8 years ago

Original comment by sjdir...@gmail.com on 18 Jul 2013 at 4:28

Changed state: Started

GoogleCodeExporter commented 8 years ago

This was fixed by Ilya and will be merged in from 1.2.2 branch.

Original comment by sjdir...@gmail.com on 19 Jul 2013 at 7:25

GoogleCodeExporter commented 8 years ago

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 1:47

Added labels: Milestone-Release1.2.3
Removed labels: Milestone-Release1.2

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Created a OnDiskCrawledUrlRepository using a watered down implementation of 
Ilya's submission. Added integration test that uses it...

Crawl_UsingOnDiskCrawledUrlRepository_VerifyCrawlResultIsAsExpected

use it like this...

using(var onDiskUrlRepo = new OnDiskCrawledUrlRepository(null, null, true))
{
    Scheduler onDiskScheduler = new Scheduler(false, onDiskUrlRepo, new FifoPagesToCrawlRepository());
    IWebCrawler crawler = new PoliteWebCrawler(null, null, null, onDiskScheduler, null, null, null, null, null);
    crawler.Crawl(new Uri("http://a.com"));
}

//The using statement above makes sure Dispose() is called on the 
OnDiskCrawledUrlRepository class which deletes the directories.

Original comment by sjdir...@gmail.com on 20 Jan 2014 at 3:58

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Original comment by sjdir...@gmail.com on 20 Jan 2014 at 4:02

Now blocking: #114

AmberishSingh / abot

Support the crawling of a relatively unlimited number of sites/pages by a single crawler instance #92