Closed GoogleCodeExporter closed 8 years ago
Christoph Engelhardt wrote...
There's another alternative to storing the URLs in memory:
compile a hash for the URL and create a file on disk with the hash as its name.
Then you can just File.Exists(hash) to see whether the URL has been crawled.
That may seem crude, but since crawling is usually a network-IO-bound task this
might work quite decently
Original comment by sjdir...@gmail.com
on 18 Apr 2013 at 9:50
The problem is keeping track if what pages were crawled. Right now abot stores
them in memory since the majority of Abot's users have to crawl a finite number
of sites/pages that usually falls into the tens of thousands, not millions and
billions.
interface ICrawledUrlRepository
{
Contains(Uri uri)
AddIfNew(Uri uri)
}
The default impl would be in memory, like it currently is and then anyone could
create an implementation backed by their db or filesystem or distributed cache,
etc....
Original comment by sjdir...@gmail.com
on 18 Apr 2013 at 10:00
Also make maxCrawlDepth a long or consider changing the maxCrawlDepth of 0 to
mean unlimited.
Also make maxPagesToCrawl a long, but i do believe a value of 0 is infinite.
Original comment by sjdir...@gmail.com
on 18 Apr 2013 at 10:02
maxCrawlDepth should be able to be set to -1 if they want it disabled
Original comment by sjdir...@gmail.com
on 19 Apr 2013 at 6:36
Original comment by sjdir...@gmail.com
on 27 Apr 2013 at 7:27
***If the crawl is using more then maxMemoryUsageInMb then switch from in
memory list of crawled pages to the impl that uses files with hashed urls as
the file name.
Pros:
Will still be just as fast until the limit is hit
Cons:
Could still cause a oem exception if they set the value to high
***Another option would be to add a config value to either use an in memory
list or a hashed file list.
Original comment by sjdir...@gmail.com
on 11 Jun 2013 at 11:14
Has anyone started this? If not I would be willing to do this and submit code
in the next few days.
Original comment by ilushk...@gmail.com
on 25 Jun 2013 at 10:15
No one has started that task. Please feel free to give it a crack. Post
questions to the forum if you get stuck. Thanks for helping out!
Steven
Original comment by sjdir...@gmail.com
on 26 Jun 2013 at 2:36
Submitting code shortly for this. What do you think about limiting the in
memory storage of urls to crawl? that can grow very large too
Original comment by i...@resultly.com
on 7 Jul 2013 at 1:05
Can you add permission so i can send changes direct to svn. Here are the
files changed for this initial contribution. It does not yet support automatic
switching, but it is now a parameter that is optional to fifoscheduler.
Default is old method which is in memory. Attached are the changed files based
on latest 1.2 branch.
Original comment by i...@resultly.com
on 7 Jul 2013 at 1:55
Attachments:
Here is a variation of the fileurl repository without using locks.. and using
loops with thread.wait during read if neccessary. I believe this will be
faster and allow writing to fs without locks.
Original comment by i...@resultly.com
on 7 Jul 2013 at 3:33
Attachments:
Original comment by sjdir...@gmail.com
on 18 Jul 2013 at 4:28
This was fixed by Ilya and will be merged in from 1.2.2 branch.
Original comment by sjdir...@gmail.com
on 19 Jul 2013 at 7:25
Original comment by sjdir...@gmail.com
on 3 Sep 2013 at 1:47
[deleted comment]
Created a OnDiskCrawledUrlRepository using a watered down implementation of
Ilya's submission. Added integration test that uses it...
Crawl_UsingOnDiskCrawledUrlRepository_VerifyCrawlResultIsAsExpected
use it like this...
using(var onDiskUrlRepo = new OnDiskCrawledUrlRepository(null, null, true))
{
Scheduler onDiskScheduler = new Scheduler(false, onDiskUrlRepo, new FifoPagesToCrawlRepository());
IWebCrawler crawler = new PoliteWebCrawler(null, null, null, onDiskScheduler, null, null, null, null, null);
crawler.Crawl(new Uri("http://a.com"));
}
//The using statement above makes sure Dispose() is called on the
OnDiskCrawledUrlRepository class which deletes the directories.
Original comment by sjdir...@gmail.com
on 20 Jan 2014 at 3:58
Original comment by sjdir...@gmail.com
on 20 Jan 2014 at 4:02
Original issue reported on code.google.com by
sjdir...@gmail.com
on 18 Apr 2013 at 9:45