[Caloris2] [URL Connector] Better handling of the folders structure.

lasryaric commented 9 months ago

The decision is to show the full hierarchy up to the root of the (sub)domain, and put each page inside its parent folder. So concretely, if we have the following url: https://docs.dust.tt/ which results in the following crawled pages:

- https://docs.dust.tt/fr/conversations
- https://docs.dust.tt/fr/apps
- https://docs.dust.tt/fr/data_sources/
- https://docs.dust.tt/fr/data_sources/documents

We should have the following structure in the end:

/ (folder)
- fr (folder)
- conversations (page)
- apps (page)
- data_sources (folder)
  - _index (page) (here it's a page and a folder, so we need a way to properly convey that). Probably just the page HTML title should do the trick?)
  - documents (page)

We can materialize the pages that are also a folder with a flag in the DB (eg: WebcrawlerPage.isPageAndFolder) and display them with the following name:

_index when isPageAndFolder is true.
last part of the url (split by /) when isPageAndFolder is false.

fontanierh commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

lasryaric commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

I think in a lot of RAG cases, this can take one or two chunk spots for nothing.

spolu commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

I think in a lot of RAG cases, this can take one or two chunk spots for nothing.

Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?

lasryaric commented 9 months ago

Ok so lets move on with # 4.

lasryaric commented 9 months ago

I will add a few random chunks to all RAG queries then :)

lasryaric commented 9 months ago

Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?

In notion you usually have control over your data source, with the crawler, you have 0 control. The permission screen is the only place where you get some degree of control.

dust-tt / dust

[Caloris2] [URL Connector] Better handling of the folders structure. #3166