Closed lasryaric closed 8 months ago
Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?
Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?
I think in a lot of RAG cases, this can take one or two chunk spots for nothing.
Still feels like a use-case where you'd want to exclude /core itself but get all children is pretty niche no ?
I think in a lot of RAG cases, this can take one or two chunk spots for nothing.
Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?
Ok so lets move on with # 4.
I will add a few random chunks to all RAG queries then :)
Theoretically agreed. In practice this is exactly the case as well for Notion and we have strong evidence that this is not an issue for our users right?
In notion you usually have control over your data source, with the crawler, you have 0 control. The permission screen is the only place where you get some degree of control.
The decision is to show the full hierarchy up to the root of the (sub)domain, and put each page inside its parent folder. So concretely, if we have the following url:
https://docs.dust.tt/
which results in the following crawled pages:We should have the following structure in the end:
/
(folder)fr
(folder)conversations
(page)apps
(page)data_sources
(folder)_index
(page) (here it's a page and a folder, so we need a way to properly convey that). Probably just the page HTML title should do the trick?)documents
(page)We can materialize the pages that are also a folder with a flag in the DB (eg:
WebcrawlerPage.isPageAndFolder
) and display them with the following name:_index
whenisPageAndFolder
is true./
) whenisPageAndFolder
is false.