Open YazdanJahedi opened 3 months ago
This is an interesting request. Do you mean like acme.com/foo/bar/baz
is depth=3, because of the 3 segments in the path? Or do you mean you want to know the crawl depth, as in how many links had to be hopped from an entry point before a given page was found?
Actually, I meant the second one. But the first criterion you wrote can also be useful in some cases
I'm not sure we'd be able to get a meaningful value for crawl depth. Let's imagine a site like:
Index
page1
subpageA
subsubpageB
page2
where the entry point is index
index
links to page1
and page2
page1
links to subpageA
subpageA
links to subsubpageB
subpageB
links to page2
What depth should page2
be at?
Logically, I'd think you'd expect 2
, since the shortest path to it is Index
-> page2
.
However, if the crawler uses a Stack to add pages to the frontier, you'd get a depth of 5
, with Index
-> page1
-> subpageA
-> subsubpageB
-> page2
. (We don't use a stack today, but a queue, but I'd be anxious about a feature that is really dependent on consistent pathing through the crawl).
Or regardless of site structure, if the website has a good sitemap.xml that lists all pages, no matter where the page is located or how it's linked, you'd always get a depth of 1
.
In fact, I meant the shortest path to reach a page. In your example above, the page depth will be equal to 2.
The logic of this idea is that most of the time in designing websites, efforts are made to make the most important pages more accessible so that users can reach those pages with less number of clicks.
For example, if there is a link to page2
on the index
(root page), it means that page2
is probably more important than subpageA
.
Now, if we go depth-by-depth during crawling, we can determine how deep each page is from the root, and this field can be very useful. Because the depth field can be effective both during crawling and when searching in a search engine to find the most relevant page. (for example we can boost the depth field in the Elasticsearch to find more relevant pages)
Problem Description
The depth of a web page in a domain often has a useful relationship with the importance of that page. Using this data, you can get better and more related search results.
Proposed Solution
Add a "page-depth" field that refers to page depth when you are crawling from the root of a domain.