dust-tt / dust

Amplify your team's potential with customizable and secure AI assistants.
https://dust.tt
MIT License
953 stars 108 forks source link

[Caloris2] [URL Connector] Better handling of the folders structure. #3166

Closed lasryaric closed 8 months ago

lasryaric commented 9 months ago

The decision is to show the full hierarchy up to the root of the (sub)domain, and put each page inside its parent folder. So concretely, if we have the following url: https://docs.dust.tt/ which results in the following crawled pages:

- https://docs.dust.tt/fr/conversations
- https://docs.dust.tt/fr/apps
- https://docs.dust.tt/fr/data_sources/
- https://docs.dust.tt/fr/data_sources/documents

We should have the following structure in the end:

We can materialize the pages that are also a folder with a flag in the DB (eg: WebcrawlerPage.isPageAndFolder) and display them with the following name:

fontanierh commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

lasryaric commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

I think in a lot of RAG cases, this can take one or two chunk spots for nothing.

spolu commented 9 months ago

Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?

I think in a lot of RAG cases, this can take one or two chunk spots for nothing.

Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?

lasryaric commented 9 months ago

Ok so lets move on with # 4.

lasryaric commented 9 months ago

I will add a few random chunks to all RAG queries then :)

lasryaric commented 9 months ago

Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?

In notion you usually have control over your data source, with the crawler, you have 0 control. The permission screen is the only place where you get some degree of control.