dust-tt / dust

Amplify your team's potential with customizable and secure AI assistants.
https://dust.tt
MIT License
953 stars 108 forks source link

[Caloris2] [URL Connector] Better handling of the folders structure. #3166

Closed lasryaric closed 8 months ago

lasryaric commented 9 months ago

The decision is to show the full hierarchy up to the root of the (sub)domain, and put each page inside its parent folder. So concretely, if we have the following url: https://docs.dust.tt/ which results in the following crawled pages:

- https://docs.dust.tt/fr/conversations
- https://docs.dust.tt/fr/apps
- https://docs.dust.tt/fr/data_sources/
- https://docs.dust.tt/fr/data_sources/documents

We should have the following structure in the end:

We can materialize the pages that are also a folder with a flag in the DB (eg: WebcrawlerPage.isPageAndFolder) and display them with the following name:

spolu commented 9 months ago

Moving to in progress since assigned and we have a PR in flight

spolu commented 9 months ago

last part of the url (split by /) when isPageAndFolder is false.

Not sure I understand the exact meaning here?

lasryaric commented 9 months ago

Not sure I understand the exact meaning here?

for /fr/articles/2014-01-24-new-supra-ai-model, the name would be 2014-01-24-new-supra-ai-model

new URL(urlstring).pathname.split('/').slice(0, -1)
lasryaric commented 9 months ago

I suppose these pages are most often summary pages and people would be interested by the list of articles that are usually in the folder, hence the initial intuition of not putting the page inside the folder, which would make it easier to select "all the articles in this folder". Other than that, I am fine with the name of the last chunk (split by /) or with _index.

spolu commented 9 months ago

We can materialize the pages that are also a folder with a flag in the DB (eg: WebcrawlerPage.isPageAndFolder) and display:

OK I understand it now. I was confused by this sentence as it does not say that we're talking about the displayed name of the page.

spolu commented 9 months ago

Probably just the page HTML title should do the trick?)

We agreed that using page title was likely a bad idea right? I think it would be a bad idea to use it for these pages but not the others. If we were to use the page title everywhere we would have no such issue at all. But from some of the tests we can see on our workspace, it seems like a not so great idea (many titles are repeated all over the structure)

An alternative could also be to not show pages at all and only folders. Not sure it's great idea but worth mentioning as a possibility.

lasryaric commented 9 months ago

Probably just the page HTML title should do the trick?)

This was here from the first draft of this doc. Indeed, we agreed about not showing this page.

Let's go with index, in think these pages often do not matter for what we are doing, apart from allowing us to crawl what's in the folder.

lasryaric commented 9 months ago

Root folder page definition: The page sitting at the top of the domain, eg: https://dust.tt/

Case 1

Screenshot 2024-01-16 at 17 14 31

Case 2

Screenshot 2024-01-16 at 17 07 56

Case 3

solution_index

Currently already coded version is Case 3.

Which one do you prefer?

spolu commented 9 months ago

My preference goes for Case 2 as it follows the 2 principles:

An alternative that is not described here also consists in having elements that are both page and folder being displayed as a page with children. This alleviates the need for _index. Why did you discard that proposition @lasryaric ?

fontanierh commented 9 months ago

I'm not sure I fully understand problem -- Why do we need pages to also be folders ? Could we have a page that is both selectable and has children, similar to the setup we have for Notion ?

Screenshot 2024-01-16 at 21 04 59
philipperolet commented 9 months ago

@fontanierh agree with you. I guess it's to have the ability to sync subpages but not the page itself? seems rare though

No strong preference, would be fine with all. If I had to choose, I would pick 3 since it matches the nextjs way, and seem kind of standard

just seeing @spolu 's points

selecting a path selects everything that is under that path when there is no sibling / sub page / folders a page is just a page

is this not the case with option 3 too?

spolu commented 9 months ago

@philipperolet i think 3 always created an _index even even there is no sub page to a page path?

lasryaric commented 9 months ago

@spolu no, case 3 juste put the page in its folder if needed and name it index.

On Wed, Jan 17, 2024 at 9:13 AM Stanislas Polu @.***> wrote:

@philipperolet https://github.com/philipperolet i think 3 always created an _index even even there is no sub page to a page path?

— Reply to this email directly, view it on GitHub https://github.com/dust-tt/dust/issues/3166#issuecomment-1895299441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACXUNLDXWP7NQ5S4ZQ2QZ3YO6B3FAVCNFSM6AAAAABBX5ZCTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVGI4TSNBUGE . You are receiving this because you were mentioned.Message ID: @.***>

fontanierh commented 9 months ago

From what I understand the difference between 2 and 3 seems to be the naming only. I dont really have a strong preference, although using the “folder” name might be simpler to understand for users (so slight preference for 2)

spolu commented 9 months ago

Ok i see the description of 3 is not coherent with the image then ?

lasryaric commented 9 months ago

@spolu for me it is correct, but we might have a different definition of "root folder page". Root folder page: the page sitting at the top of the domain name, eg: https://dust.tt/. (updated the doc to have everyone on the same page).

lasryaric commented 9 months ago

@philipperolet i think 3 always created an _index even even there is no sub page to a page path?

To be precise, that is not correct. 2 and 3 are the same except that the naming of the "page inside its own folder" is named differently.

lasryaric commented 9 months ago

I'm not sure I fully understand problem -- Why do we need pages to also be folders ? Could we have a page that is both selectable and has children, similar to the setup we have for Notion ?

@fontanierh what I don't like with this solution and the reason why it was not presented here, it's that you can't distinguish a page and a folder anymore, which I think is worse than case #2 or #3.

spolu commented 9 months ago

The UI will let you distinguish the two thanks to the permission type (file vs folder)?

fontanierh commented 9 months ago

you can't distinguish a page and a folder anymore

What exactly do you mean by this ? "Folder" isn't a real thing on websites AFAIK.

Could we say that:

?

lasryaric commented 9 months ago

What exactly do you mean by this ? "Folder" isn't a real thing on websites AFAIK.

Folder is not really a thing in a website I agree, but we are creating a mapping between the path and a folder on our side so we do end up with an actual folder structure. The only thing that we are debating here is: where should site a page for which the URL is both a folder and page, in this folder structure of our own?

The UI will let you distinguish the two thanks to the permission type (file vs folder)?

By distinguish I mean "distinguish in your selection". You can see it, but you can't select a folder and not the page, or the page and not the folder.

spolu commented 9 months ago

It is indeed the one downside associated with this approach.

fontanierh commented 9 months ago

You can see it, but you can't select a folder and not the page

With the UI we have for notion, you can select some children and not the parent page if you want, or you can select the whole page with its children. However, you cannot select the parent page without its children. IMO that is a reasonable limitation that is a small price to pay for a simpler, more user-friendly solution

lasryaric commented 9 months ago

Could we say that:

a page is a page a folder is a URL path "segment" that isn't a page pages can have children folders always have children

Yes but I don't see where this takes us.

spolu commented 9 months ago

@fontanierh has à point that this is already the case in Notion and no one ever complained about it + we can in the future support it if we want to.

lasryaric commented 9 months ago

With the UI we have for notion, you can select some children and not the parent page if you want, or you can select the whole page with its children. However, you cannot select the parent page without its children. IMO that is a reasonable limitation that is a small price to pay for a simpler, more user-friendly solution

Honestly this is a limitation that I think is bad, crawling will lead to all sorts of use-cases and allowing selection of what you want in the assistant creation UI is the key to a lot of the crawling "edge" cases if we can call them edge cases, so going with a solution where in the end you cannot select what you want is probably the worst of all in my opinion. That's why we initially rejected it with @spolu and that's why it was not listed here.

lasryaric commented 9 months ago

@fontanierh has à point that this is already the case in Notion and no one ever complained about it + we can in the future support it if we want to.

I am strongly against this opinion. From my little experience of crawling a bunch of websites working on it, not letting the users clearly select what they want is probably the worst.

Notion and crawling are two different things.

lasryaric commented 9 months ago

These pages are often the "crawling entry point" but contains only blurb of the data you are interested in, and being able to exclude them seems important to me.

lasryaric commented 9 months ago

(Hence the case number 1)

fontanierh commented 9 months ago

Notion and crawling are two different things.

On this specific topic, we had the exact same discussions with @Duncid for Notion and landed on what I linked above. I think what is even more important than being able to exclude index pages is to not miss future children of a page, which you can only do if the parentIds include the ID of the page itself (in which case you would have the index page included anyway)

lasryaric commented 9 months ago

On this specific topic, we had the exact same discussions with @Duncid for Notion and landed on what I linked above. I think what is even more important than being able to exclude index pages is to not miss future children of a page, which you can only do if the parentIds include the ID of the page itself (in which case you would have the index page included anyway)

Parents are managed separately, we can do whatever we want here.

fontanierh commented 9 months ago

So in the end, it's the exact same: you can select manually all children you are interested in (without including the parent itself) and you'll miss future children, or you can include the parent (along with the index page) and you'll get the future children. I don't think the solutions you are suggesting have any real advantage

lasryaric commented 9 months ago

So in the end, it's the exact same: you can select manually all children you are interested in (without including the parent itself) and you'll miss future children, or you can include the parent (along with the index page) and you'll get the future children. I don't think the solutions you are suggesting have any real advantage

It has the advantage of allowing you to select the page or not. Missing the follow up children is a problem that we have with the solution you suggest (lets call it # 4) and with #3 and #2.

The only one that does not creates this issue is the solution # 1, which I initially coded and which I would go for if I was the only one deciding (minus a few adjustments in the titles of the pages). I think it clearly shows the page and its children, you can select all future children without selecting the page, it's just a bit less pretty in the UI.

fontanierh commented 9 months ago

It has the advantage of allowing you to select the page or not.

You can also do this with the Notion setup, you are able to select child pages without selecting the parent, so the Case2 and Case3 have no advantages over it IMO, and have the disadvantage of being more confusing.

Case1 is indeed technically superior as it allows to "express" more things in the end. I'd argue that the added complexity isn't worth it, but I would still be fine with it.

I'm in favor of either "notion style" or Case1 👍

lasryaric commented 9 months ago

You can also do this with the Notion setup, you are able to select child pages without selecting the parent, so the Case2 and Case3 have no advantages over it IMO, and have the disadvantage of being more confusing.

While I understand what you are saying, I think the very rational and technical conversation we are having to understand all of that will have to be done by our users, which is a tough ask.

I will work on a prettier version of Case 1 and send a screenshot.

lasryaric commented 9 months ago

Case1 is indeed technically superior as it allows to "express" more things in the end. I'd argue that the added complexity isn't worth it, but I would still be fine with it.

@fontanierh I think the concept of complexity is highly subjective. The most natural way for me to express this complexity was case 1. if you have a page: /articles

and then a 100 articles with paths like :

/articles/001-abc....
/articles/002-def....

Then having the page /articles tangled with the actual articles can be seen as a lot of complexity by the end user. And I am sure there are cases where you are right, but after playing with it for quite a few days now, my intuition strongly pushes me away from solution # 4.

spolu commented 9 months ago

@lasryaric do you have examples where you feel it's very clear that not being able to select the page alone is an issue? The data we have is that this is the same for Notion and no-one ever complained right?

Additionally if this is an issue this is something we can support (selecting a page without their parent (though it would be quite involved arguably)). If we project ourselves in a world where we support this, then the page with children solution seems strictly superior to all others no?

Given this I'm leaning towards thinking this is the right solution because it's the most principled structure we can think of for URL crawling.

lasryaric commented 9 months ago

@lasryaric do you have examples where you feel it's very clear that not being able to select the page alone is an issue? The data we have is that this is the same for Notion and no-one ever complained right?

Here are 3 examples out of the 4 first ones we crawled on our workspace:

spolu commented 9 months ago

https://dust.tt/w/0ec9852c2f/builder/data-sources/www.chartjs.org-docs-latest (docs -> master)

Why would you want master alone? Do you mean you don't want master? Is that really an issue?

https://dust.tt/w/0ec9852c2f/builder/data-sources/api.slack.com-reference (reference)

Same?

https://dust.tt/w/0ec9852c2f/builder/data-sources/support.360learning.com-hc-fr (fr -> fr)

Why would you want fr alone?

fontanierh commented 9 months ago

Why would you want master alone? Do you mean you don't want master? Is that really an issue?

I guess it's also the other way around ? Ability to select all children but not master itself (which is the upside of Case 1)

lasryaric commented 9 months ago

I guess it's also the other way around ? Ability to select all children but not master itself (which is the upside of Case 1)

Yes, the ability to select all children but not master itself.

spolu commented 9 months ago

Given that these files are mostly index files that are quite small, is that really an issue?

Also if you really care about that for all cases where the content is quite static you can just go ahead and select the children

lasryaric commented 9 months ago

Given that these files are mostly index files that are quite small, is that really an issue?

It's small in these cases, it could be big in other cases.

spolu commented 9 months ago

Proposal

Invariants

Advantages

This is the most principled mapping we can do given the nature of URLs where /ab/c can be a page yet a/b/c/d can also be a page.

Disadvantages

In our current permission selection logic it means we can't select a page without selecting all its children. It also makes selecting all the children of a page without the page itself more complicated (select all children manually) or impossible in light of content that changes regularly (as selecting children will miss future children).

This is obviously a disadvantage but one that exists in Notion and has caused 0 issues with our users. The example shown in this thread related to that disadvantage mostly refer to cases where the page is an index and what we care about is the content of children. This exact case is very aligned with Notion and was not reported by our users as an issue (hypothesis is: the index content is negligible compared to children content and therefore not impacting quality of retrievals)

Path to ideal solution

This structured is the most principled and an ideal solution would consists in having the capability to select a node children without the node itself in connectors permissions or selecting a node without its children. Both could be built in the future.

Alternatives

Alternatives consists (presence of page with children) in: (i) showing a page as children of its URL with the same name (a/b + a/b/b) (ii) showing a page as children of its URL with a special name (a/b + a/b/_index) (iii) showing a page as sibling of its URL when there are children (a/b (folder) + a/b (page))

(iii) is likley confusing to the user. Potential question: what am I supposed to do to get the content of a/b ?

(i) and (ii) resolve the disadvantage of the approach proposed but show content that does not match with any existing URL which could be moderately confusing + instantiates objects with pathes that do not exist for real and hence appears as less principled.

Rationale

The rationale for pushing for the proposal is to stick with the most principled approach especially as we have a path towards alleviating all disadvantages if they are strongly reported by users, which would benefit all other connectors.

The value of going with a principled approach is that we can clearly explain it to users. All other alternatives are non principled and as such can be pushed back against by users with no clear response as to why we went with that appraoch.

spolu commented 9 months ago

r? @dust-tt/engineering-team

spolu commented 9 months ago

It's small in these cases, it could be big in other cases.

So this is a speculative statement?

lasryaric commented 9 months ago

So this is a speculative statement?

yes

lasryaric commented 9 months ago

So this is a speculative statement?

This one for example is not small. There is no rule, hence my suggestion of suggesting both.

https://crawlee.dev/api/core#EventTypeName

spolu commented 9 months ago

So this is a speculative statement?

This one for example is not small. There is no rule, hence my suggestion of suggesting both.

https://crawlee.dev/api/core#EventTypeName

Not sure I get the example mind elaborating?

lasryaric commented 9 months ago

Not sure I get the example mind elaborating?

The /api/core page has some content, and pretty much all links on this page are children of /api/core.