Closed lasryaric closed 8 months ago
Moving to in progress since assigned and we have a PR in flight
last part of the url (split by /) when isPageAndFolder is false.
Not sure I understand the exact meaning here?
Not sure I understand the exact meaning here?
for /fr/articles/2014-01-24-new-supra-ai-model
, the name would be 2014-01-24-new-supra-ai-model
new URL(urlstring).pathname.split('/').slice(0, -1)
I suppose these pages are most often summary pages and people would be interested by the list of articles that are usually in the folder, hence the initial intuition of not putting the page inside the folder, which would make it easier to select "all the articles in this folder".
Other than that, I am fine with the name of the last chunk (split by /
) or with _index
.
We can materialize the pages that are also a folder with a flag in the DB (eg: WebcrawlerPage.isPageAndFolder) and display:
OK I understand it now. I was confused by this sentence as it does not say that we're talking about the displayed name of the page.
Probably just the page HTML title should do the trick?)
We agreed that using page title was likely a bad idea right? I think it would be a bad idea to use it for these pages but not the others. If we were to use the page title everywhere we would have no such issue at all. But from some of the tests we can see on our workspace, it seems like a not so great idea (many titles are repeated all over the structure)
An alternative could also be to not show pages at all and only folders. Not sure it's great idea but worth mentioning as a possibility.
Probably just the page HTML title should do the trick?)
This was here from the first draft of this doc. Indeed, we agreed about not showing this page.
Let's go with index, in think these pages often do not matter for what we are doing, apart from allowing us to crawl what's in the folder.
Root folder page definition:
The page sitting at the top of the domain, eg: https://dust.tt/
_index
._index
_index
_index
, sitting as a child of the folderCurrently already coded version is Case 3.
Which one do you prefer?
My preference goes for Case 2 as it follows the 2 principles:
An alternative that is not described here also consists in having elements that are both page and folder being displayed as a page with children. This alleviates the need for _index. Why did you discard that proposition @lasryaric ?
I'm not sure I fully understand problem -- Why do we need pages to also be folders ? Could we have a page that is both selectable and has children, similar to the setup we have for Notion ?
@fontanierh agree with you. I guess it's to have the ability to sync subpages but not the page itself? seems rare though
No strong preference, would be fine with all. If I had to choose, I would pick 3 since it matches the nextjs way, and seem kind of standard
just seeing @spolu 's points
selecting a path selects everything that is under that path when there is no sibling / sub page / folders a page is just a page
is this not the case with option 3 too?
@philipperolet i think 3 always created an _index even even there is no sub page to a page path?
@spolu no, case 3 juste put the page in its folder if needed and name it index.
On Wed, Jan 17, 2024 at 9:13 AM Stanislas Polu @.***> wrote:
@philipperolet https://github.com/philipperolet i think 3 always created an _index even even there is no sub page to a page path?
— Reply to this email directly, view it on GitHub https://github.com/dust-tt/dust/issues/3166#issuecomment-1895299441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACXUNLDXWP7NQ5S4ZQ2QZ3YO6B3FAVCNFSM6AAAAABBX5ZCTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVGI4TSNBUGE . You are receiving this because you were mentioned.Message ID: @.***>
From what I understand the difference between 2 and 3 seems to be the naming only. I dont really have a strong preference, although using the “folder” name might be simpler to understand for users (so slight preference for 2)
Ok i see the description of 3 is not coherent with the image then ?
@spolu for me it is correct, but we might have a different definition of "root folder page".
Root folder page: the page sitting at the top of the domain name, eg: https://dust.tt/
.
(updated the doc to have everyone on the same page).
@philipperolet i think 3 always created an _index even even there is no sub page to a page path?
To be precise, that is not correct. 2 and 3 are the same except that the naming of the "page inside its own folder" is named differently.
I'm not sure I fully understand problem -- Why do we need pages to also be folders ? Could we have a page that is both selectable and has children, similar to the setup we have for Notion ?
@fontanierh what I don't like with this solution and the reason why it was not presented here, it's that you can't distinguish a page and a folder anymore, which I think is worse than case #2 or #3.
The UI will let you distinguish the two thanks to the permission type (file vs folder)?
you can't distinguish a page and a folder anymore
What exactly do you mean by this ? "Folder" isn't a real thing on websites AFAIK.
Could we say that:
?
What exactly do you mean by this ? "Folder" isn't a real thing on websites AFAIK.
Folder is not really a thing in a website I agree, but we are creating a mapping between the path and a folder on our side so we do end up with an actual folder structure. The only thing that we are debating here is: where should site a page for which the URL is both a folder and page, in this folder structure of our own?
The UI will let you distinguish the two thanks to the permission type (file vs folder)?
By distinguish I mean "distinguish in your selection". You can see it, but you can't select a folder and not the page, or the page and not the folder.
It is indeed the one downside associated with this approach.
You can see it, but you can't select a folder and not the page
With the UI we have for notion, you can select some children and not the parent page if you want, or you can select the whole page with its children. However, you cannot select the parent page without its children. IMO that is a reasonable limitation that is a small price to pay for a simpler, more user-friendly solution
Could we say that:
a page is a page a folder is a URL path "segment" that isn't a page pages can have children folders always have children
Yes but I don't see where this takes us.
@fontanierh has à point that this is already the case in Notion and no one ever complained about it + we can in the future support it if we want to.
With the UI we have for notion, you can select some children and not the parent page if you want, or you can select the whole page with its children. However, you cannot select the parent page without its children. IMO that is a reasonable limitation that is a small price to pay for a simpler, more user-friendly solution
Honestly this is a limitation that I think is bad, crawling will lead to all sorts of use-cases and allowing selection of what you want in the assistant creation UI is the key to a lot of the crawling "edge" cases if we can call them edge cases, so going with a solution where in the end you cannot select what you want is probably the worst of all in my opinion. That's why we initially rejected it with @spolu and that's why it was not listed here.
@fontanierh has à point that this is already the case in Notion and no one ever complained about it + we can in the future support it if we want to.
I am strongly against this opinion. From my little experience of crawling a bunch of websites working on it, not letting the users clearly select what they want is probably the worst.
Notion and crawling are two different things.
These pages are often the "crawling entry point" but contains only blurb of the data you are interested in, and being able to exclude them seems important to me.
(Hence the case number 1)
Notion and crawling are two different things.
On this specific topic, we had the exact same discussions with @Duncid for Notion and landed on what I linked above. I think what is even more important than being able to exclude index pages is to not miss future children of a page, which you can only do if the parentIds include the ID of the page itself (in which case you would have the index page included anyway)
On this specific topic, we had the exact same discussions with @Duncid for Notion and landed on what I linked above. I think what is even more important than being able to exclude index pages is to not miss future children of a page, which you can only do if the parentIds include the ID of the page itself (in which case you would have the index page included anyway)
Parents are managed separately, we can do whatever we want here.
So in the end, it's the exact same: you can select manually all children you are interested in (without including the parent itself) and you'll miss future children, or you can include the parent (along with the index page) and you'll get the future children. I don't think the solutions you are suggesting have any real advantage
So in the end, it's the exact same: you can select manually all children you are interested in (without including the parent itself) and you'll miss future children, or you can include the parent (along with the index page) and you'll get the future children. I don't think the solutions you are suggesting have any real advantage
It has the advantage of allowing you to select the page or not. Missing the follow up children is a problem that we have with the solution you suggest (lets call it # 4) and with #3 and #2.
The only one that does not creates this issue is the solution # 1, which I initially coded and which I would go for if I was the only one deciding (minus a few adjustments in the titles of the pages). I think it clearly shows the page and its children, you can select all future children without selecting the page, it's just a bit less pretty in the UI.
It has the advantage of allowing you to select the page or not.
You can also do this with the Notion setup, you are able to select child pages without selecting the parent, so the Case2 and Case3 have no advantages over it IMO, and have the disadvantage of being more confusing.
Case1 is indeed technically superior as it allows to "express" more things in the end. I'd argue that the added complexity isn't worth it, but I would still be fine with it.
I'm in favor of either "notion style" or Case1 👍
You can also do this with the Notion setup, you are able to select child pages without selecting the parent, so the Case2 and Case3 have no advantages over it IMO, and have the disadvantage of being more confusing.
While I understand what you are saying, I think the very rational and technical conversation we are having to understand all of that will have to be done by our users, which is a tough ask.
I will work on a prettier version of Case 1 and send a screenshot.
Case1 is indeed technically superior as it allows to "express" more things in the end. I'd argue that the added complexity isn't worth it, but I would still be fine with it.
@fontanierh I think the concept of complexity is highly subjective.
The most natural way for me to express this complexity was case 1.
if you have a page:
/articles
and then a 100 articles with paths like :
/articles/001-abc....
/articles/002-def....
Then having the page /articles
tangled with the actual articles can be seen as a lot of complexity by the end user. And I am sure there are cases where you are right, but after playing with it for quite a few days now, my intuition strongly pushes me away from solution # 4.
@lasryaric do you have examples where you feel it's very clear that not being able to select the page alone is an issue? The data we have is that this is the same for Notion and no-one ever complained right?
Additionally if this is an issue this is something we can support (selecting a page without their parent (though it would be quite involved arguably)). If we project ourselves in a world where we support this, then the page with children solution seems strictly superior to all others no?
Given this I'm leaning towards thinking this is the right solution because it's the most principled structure we can think of for URL crawling.
@lasryaric do you have examples where you feel it's very clear that not being able to select the page alone is an issue? The data we have is that this is the same for Notion and no-one ever complained right?
Here are 3 examples out of the 4 first ones we crawled on our workspace:
https://dust.tt/w/0ec9852c2f/builder/data-sources/www.chartjs.org-docs-latest (docs -> master)
Why would you want master alone? Do you mean you don't want master? Is that really an issue?
https://dust.tt/w/0ec9852c2f/builder/data-sources/api.slack.com-reference (reference)
Same?
https://dust.tt/w/0ec9852c2f/builder/data-sources/support.360learning.com-hc-fr (fr -> fr)
Why would you want fr
alone?
Why would you want master alone? Do you mean you don't want master? Is that really an issue?
I guess it's also the other way around ? Ability to select all children but not master itself (which is the upside of Case 1)
I guess it's also the other way around ? Ability to select all children but not master itself (which is the upside of Case 1)
Yes, the ability to select all children but not master itself.
Given that these files are mostly index files that are quite small, is that really an issue?
Also if you really care about that for all cases where the content is quite static you can just go ahead and select the children
Given that these files are mostly index files that are quite small, is that really an issue?
It's small in these cases, it could be big in other cases.
https://dust.tt/a/b
=> / -> a -> b
)This is the most principled mapping we can do given the nature of URLs where /ab/c
can be a page yet a/b/c/d
can also be a page.
In our current permission selection logic it means we can't select a page without selecting all its children. It also makes selecting all the children of a page without the page itself more complicated (select all children manually) or impossible in light of content that changes regularly (as selecting children will miss future children).
This is obviously a disadvantage but one that exists in Notion and has caused 0 issues with our users. The example shown in this thread related to that disadvantage mostly refer to cases where the page is an index and what we care about is the content of children. This exact case is very aligned with Notion and was not reported by our users as an issue (hypothesis is: the index content is negligible compared to children content and therefore not impacting quality of retrievals)
This structured is the most principled and an ideal solution would consists in having the capability to select a node children without the node itself in connectors permissions or selecting a node without its children. Both could be built in the future.
Alternatives consists (presence of page with children) in:
(i) showing a page as children of its URL with the same name (a/b
+ a/b/b
)
(ii) showing a page as children of its URL with a special name (a/b
+ a/b/_index
)
(iii) showing a page as sibling of its URL when there are children (a/b
(folder) + a/b
(page))
(iii) is likley confusing to the user. Potential question: what am I supposed to do to get the content of a/b
?
(i) and (ii) resolve the disadvantage of the approach proposed but show content that does not match with any existing URL which could be moderately confusing + instantiates objects with pathes that do not exist for real and hence appears as less principled.
The rationale for pushing for the proposal is to stick with the most principled approach especially as we have a path towards alleviating all disadvantages if they are strongly reported by users, which would benefit all other connectors.
The value of going with a principled approach is that we can clearly explain it to users. All other alternatives are non principled and as such can be pushed back against by users with no clear response as to why we went with that appraoch.
r? @dust-tt/engineering-team
It's small in these cases, it could be big in other cases.
So this is a speculative statement?
So this is a speculative statement?
yes
So this is a speculative statement?
This one for example is not small. There is no rule, hence my suggestion of suggesting both.
So this is a speculative statement?
This one for example is not small. There is no rule, hence my suggestion of suggesting both.
Not sure I get the example mind elaborating?
Not sure I get the example mind elaborating?
The /api/core page has some content, and pretty much all links on this page are children of /api/core.
The decision is to show the full hierarchy up to the root of the (sub)domain, and put each page inside its parent folder. So concretely, if we have the following url:
https://docs.dust.tt/
which results in the following crawled pages:We should have the following structure in the end:
/
(folder)fr
(folder)conversations
(page)apps
(page)data_sources
(folder)_index
(page) (here it's a page and a folder, so we need a way to properly convey that). Probably just the page HTML title should do the trick?)documents
(page)We can materialize the pages that are also a folder with a flag in the DB (eg:
WebcrawlerPage.isPageAndFolder
) and display them with the following name:_index
whenisPageAndFolder
is true./
) whenisPageAndFolder
is false.