Letractively / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Implement crawl depth #37

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Implement crawl depth

Original issue reported on code.google.com by sjdir...@gmail.com on 21 Nov 2012 at 11:13

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 22 Nov 2012 at 8:13

GoogleCodeExporter commented 9 years ago
Not sure i want to implement this. To implement it properly I have to make to 
many assumptions of what Abot would be used for. If the crawl decisions use 
crawl depth to determine what pages to crawl and whether or not to crawl that 
pages links, you would also need to know if the end user cares which path lead 
the crawl to that domain. For example...

If we crawled... 
a.com (depth 0) 
  a.com has a link to b.com (depth 1)
  a.com has a link to c.com (depth 1)
    c.com has a link to b.com (depth 1 or 2?? or both)

Seeing the table above you can see that b.com could be considered a depth of 1 
or 2 or both. I would rather not implement this feature before I make an 
assumption.

Original comment by sjdir...@gmail.com on 22 Nov 2012 at 10:55

GoogleCodeExporter commented 9 years ago
Is the Depth indication working now?  I am seeing Depths of only 0 and 1.  I 
think the basic Depth indication could apply only to links within the targeted 
domain.  Following foreign links, a depth indication would be good but 
suggestion is to start the depth setting over -- in effect have a hierarchy of 
depths, each foreign link from the target domain being the root of a branch.  
My specific need is to crawl only within a domain, no foreign links, up to 
pages that link to pages linked to the home page (as I understand, that would 
be up to Depth 3).  Is it possible to restrict the search in this manner now?

Original comment by smsmith...@aol.com on 17 Feb 2014 at 9:48

GoogleCodeExporter commented 9 years ago
As of now, the depth will be calculated from the root domain where the crawl 
was started. Ie.. 0 being the homepage, 1 being any page that was found because 
its link was on the homepage and so on. 

Original comment by sjdir...@gmail.com on 17 Feb 2014 at 11:08