elastic / crawler

Other
104 stars 5 forks source link

Add "depth" field #52

Open YazdanJahedi opened 3 months ago

YazdanJahedi commented 3 months ago

Problem Description

The depth of a web page in a domain often has a useful relationship with the importance of that page. Using this data, you can get better and more related search results.

Proposed Solution

Add a "page-depth" field that refers to page depth when you are crawling from the root of a domain.

seanstory commented 3 months ago

This is an interesting request. Do you mean like acme.com/foo/bar/baz is depth=3, because of the 3 segments in the path? Or do you mean you want to know the crawl depth, as in how many links had to be hopped from an entry point before a given page was found?

YazdanJahedi commented 3 months ago

Actually, I meant the second one. But the first criterion you wrote can also be useful in some cases

seanstory commented 3 months ago

I'm not sure we'd be able to get a meaningful value for crawl depth. Let's imagine a site like:

Index
page1
  subpageA
    subsubpageB
page2

where the entry point is index

What depth should page2 be at?

Logically, I'd think you'd expect 2, since the shortest path to it is Index -> page2.

However, if the crawler uses a Stack to add pages to the frontier, you'd get a depth of 5, with Index -> page1 -> subpageA -> subsubpageB -> page2. (We don't use a stack today, but a queue, but I'd be anxious about a feature that is really dependent on consistent pathing through the crawl).

Or regardless of site structure, if the website has a good sitemap.xml that lists all pages, no matter where the page is located or how it's linked, you'd always get a depth of 1.

YazdanJahedi commented 3 months ago

In fact, I meant the shortest path to reach a page. In your example above, the page depth will be equal to 2.

The logic of this idea is that most of the time in designing websites, efforts are made to make the most important pages more accessible so that users can reach those pages with less number of clicks.

For example, if there is a link to page2 on the index (root page), it means that page2 is probably more important than subpageA. Now, if we go depth-by-depth during crawling, we can determine how deep each page is from the root, and this field can be very useful. Because the depth field can be effective both during crawling and when searching in a search engine to find the most relevant page. (for example we can boost the depth field in the Elasticsearch to find more relevant pages)