OrderedCollection: persist items order

srosset81 commented 3 years ago

Issue Currently, to manage OrderedCollection, items are ordered according to a predicate (for example as:published). But normally the order of items is persisted.

Proposal

Use rdf:Seq or rdf:List to persist the order of items.
This library could be used (or not): https://github.com/ontola/rdfdev-js/tree/master/packages/collections
When attaching an item to a collection, it should be possible to define whether to put it at the beginning or the end of the collection.
Remove the semapps:sortPredicate and semapps:sortOrder predicates from ordered collections

srosset81 commented 1 month ago

This would be important for performances !

For an inbox with 6000 activities, and with WAC permissions check, it takes 25-30s to load the first page (and all other pages) because all activities must be ordered by the as:published activity. If we correctly persisted the number of items, we would not need to load and sort this whole list ! And it would be better for standards compliance too.

srosset81 commented 1 month ago

Some docs about RDF ordering:

srosset81 commented 1 month ago

One possibility could be to persist the OrderedCollectionPages (SemApps does not persist them at the moment, they are auto-generated by the middleware). It would be faster to go to a specific page.

The ActivityStreams spec has a long section about collections: https://www.w3.org/TR/activitystreams-core/#collections

Laurin-W commented 1 month ago

Maybe it makes sense to distinguish here between RDF representation and implementation. Personally, I like the sort predicate as a concept quite well because it fits better into the world of rdf and also local first. Here it is the same, the rdf representation is not reflected in the structure of the rdf data persisted in the graph db.

One possibility could be to persist the OrderedCollectionPages

You mean in the redis cache for example?

Spinning this further, we could even build a custom index (for at least some predicates) in the middleware to handle this but it's not really pretty either... Ideally the database would support indexing based on a predicate, like sql databases support this as well.

srosset81 commented 1 month ago

Ideally the database would support indexing based on a predicate, like sql databases support this as well.

The problem is not indexing. Fuseki handles indexing very well. The problem is that, by definition, a triplestore cannot keep track of the order of a list, so we need to rely on the rdf:Seq or rdf:List do that.

One possibility could be to persist the OrderedCollectionPages

You mean in the redis cache for example?

No I was suggesting to persist it in the triple store. Right now when we fetch a collection, we build "on the fly" the special API for collections, such as totalItems, first, last, etc. We also build "on the fly" the CollectionPage or OrderedCollectionPage. I was suggesting that maybe ALL these informations could be persisted in the triple store, so that they could be fetched like any other resource (which would be much cleaner from a Solid point of view).

I'm not sure however it's a good idea, because it would be a lot of work to "rearrange" the triplestore whenever there is a change. For example the last predicate would need to be updated for all pages whenever a new page is added. The other problem is that, in the inbox and outbox, we are supposed to put the latest activity at the very top. So potentially we need to rework all the pages whenever there is a new item.

So with this solution, the GET may be faster, but the POST may be longer. This is not a big problem, but I'm more afraid of the kind of bugs that could come out of this.

srosset81 commented 3 weeks ago

FYI I found out that the LDP paging specs allow some sort of ordering, but it is not persisted: rather, it is only a way for a server to inform the client the way the resources have been ordered when paging is requested.

See the example of what can be added to a container:

<> ldp:pageSortCriteria  ( <#Sort-o.marketValue-Ascending> ).

<#Sort-o.marketValue-Ascending>
   a ldp:pageSortCriterion;
   ldp:pageSortOrder ldp:Ascending;
   ldp:pageSortPredicate o:marketValue.

That looks a bit like what we implement now now for ActivityStreams ordered collections (like in this example). If we applied this part of the LDP paging spec, we would certainly have the same kind of performance issues.

Anyway, if LDP cannot handle ordered lists, then it means we probably can't consider ActivityStreams collections as another "representation" of LDP containers, as @elf-pavlik suggested in our recent call. Or at least not ordered collections (used for inboxes and outboxes)

niko-ng commented 2 weeks ago

This would be important for performances !

For an inbox with 6000 activities, and with WAC permissions check, it takes 25-30s to load the first page (and all other pages) because all activities must be ordered by the as:published activity. If we correctly persisted the number of items, we would not need to load and sort this whole list ! And it would be better for standards compliance too.

We will have to be very careful here on this topic.

I am planning to do some major work in NextGraph to have ordered collections (arrays) work well, performance wise, and also compatible with CRDT. But for now, we do not have that.

My first question is about this query that takes 30s. This is not normal and can probably optimised.

The first problem I see is that you fetch all the triples in memory, and order them in JS. This is suboptimal. RDF and SPARQL can do that for you, and can also do range queries, but only if the datatype of as:published is an integer. Unfortunately I see in the specs that it is a dateTime. Triplestores in general cannot index dates. Jena is one of them. Oxigraph the same. So we have to convert the date into a long integer (unix timestamp) and then queries on that value (filter, order, range) will work well with much better perfs (including in Jena), and more importantly, we will not need to load all the items in JS memory in order to do the ordering.

Then the discussion about pagination is not an easy one. I started to talk about it in Matrix with Laurin and we have to understand exactly what are the uses cases, because it has an impact on how we implement it and what kind of features we can skip, in order to improve prefs.

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

Another question: do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline? If this is the case, the API can be simpler, with just ?nextPageAfter=XXX&page_size=10 and XXX would be the long integer of the last activity we have received in the previous page. Then it is easy to ask in SPARQL, for the preceding 10 items of XXX. (10 being the page size here, as an example). There is nothing in the as:Collection specs that hints at a way to retrieve random pages! It seems that their usecase is more to start from the first page, and iteratively go from one page to the next. This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

Then there is the question of how to retrieve the pages. They can be generated in memory on the fly (on request), or they can be persisted (I previously came up with this idea) but less easily if we can insert items at the beginning of the collection (it will create a first page that is not full, which is not very nice in terms of UX). And not possible at all if we want to insert random items in the middle of the list.

If we can persist the pages, then we do not need to persist the individual order of each item, as this is easily fetched by knowing the starting item of a page, and querying in the next X ordered items (X being the page size). Jena will do that easily.

Please note that Jena has no support for rdf:seq at all, and will not index the data according to the rdf:_1 predicates. SPARQL will not help us here if we use this kind of order persisting. So I think this is a dead end. Jena has support for rdf:List (which are linked lists) but only if the ARQ engine is loaded, which is not the case for us. And anyway we do not want rdf:List (unless we change the usecase and accept that random page lookup is not possible).

Please detail here below the user cases (Inbox, Outbox, other ordered collections, and the requirements for them (insert at the begin, end, in the middle, and the rational behind those requirements).

niko-ng commented 2 weeks ago

@srosset81

I was suggesting to persist [the OrderCollectionPage]s in the triple store

in fact, I was suggesting that ;)

srosset81 commented 2 weeks ago

Hi @niko-ng, thanks for your comment ! Just a few quick answers, in the hope it helps you. I'll read your comment more thoroughly later, in case I missed some important informations.

The first problem I see is that you fetch all the triples in memory, and order them in JS. This is suboptimal. RDF and SPARQL can do that for you, and can also do range queries, but only if the datatype of as:published is an integer.

No, we don't sort them with JS. The SPARQL query that takes 30s is here:

https://github.com/assemblee-virtuelle/semapps/blob/83e7a7e07d9a1e21cc8f4e0511459bbfb01c20b3/src/middleware/packages/activitypub/services/activitypub/subservices/collection.js#L264-L283

Triplestores in general cannot index dates. Jena is one of them. Oxigraph the same. So we have to convert the date into a long integer (unix timestamp) and then queries on that value (filter, order, range) will work well with much better perfs (including in Jena), and more importantly, we will not need to load all the items in JS memory in order to do the ordering.

OK, but then this would be something internal to NextGraph ? Because if we want to conform to ActivityPub specs, the published date must be a ... date.

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

Well the specs about OrderedCollections do not specify that, so in theory we should be able to add items anywhere we want. But in ActivityPub, OrderedCollections are used only for inboxes and outboxes (and I don't know of other usages in the fediverse, but I'm not an expert), so naturally it should be added at the beginning or the end (the most recent activities should appear on the first page, but we could use a reverse order).

Another question: do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline? If this is the case, the API can be simpler, with just ?nextPageAfter=XXX&page_size=10 and XXX would be the long integer of the last activity we have received in the previous page. Then it is easy to ask in SPARQL, for the preceding 10 items of XXX. (10 being the page size here, as an example). There is nothing in the as:Collection specs that hints at a way to retrieve random pages! It seems that their usecase is more to start from the first page, and iteratively go from one page to the next. This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

Collections only need to implement first, prev, next and last predicates (and I'm not even sure they are all required, but probably recommended). So we can use any query strings we want. What you are suggesting reminds me of the way pagination is usually handed with GraphQL (with what they call "edges")

Please note that Jena has no support for rdf:seq at all, and will not index the data according to the rdf:_1 predicates. SPARQL will not help us here if we use this kind of order persisting. So I think this is a dead end.

Do we really need special SPARQL queries here ? If we want the items 20 to 30 in a rdf:seq, can't we just retrieve rdf:_20, rdf:_21, and so on ?

Laurin-W commented 2 weeks ago

As far as I understood, we need to be able to add an item at the beginning of the list, and at the end of it, but not in the middle. Please confirm or debate on that here as it is important to know the answer now.

I'd say so too. For removing items though, it's not the same. E.g. if I undo a Like activity, I don't want that to show up in the likes collection where it says:

Every actor MAY have a liked collection. This is a list of every object from all of the actor's Like activities, added as a side effect. The liked collection MUST be either an OrderedCollection or a Collection and MAY be filtered on privileges of an authenticated user or as appropriate when no authentication is given.

So we also get the problem of filtering (which we have already though).

do we need to randomly access arbitrary page numbers "out of nowhere" ? like ?page=10 or can we reduce the usecase to a timeline scrolling where we start from the last item in the collection, and we repeatedly want the pages one by one going down in the "past" of the timeline [...] It seems that their usecase is more to start from the first page, and iteratively go from one page to the next

I would say so too. Though I see a use-case where you want to see all posts that were added before e.g. 2023.

This is indeed the behaviour of a timeline. accessing random pages is more usually seen on a "table" representation of some data, where you have a pagination widget at the bottom that lets you click on any page number. Is that something that we need for collections?

I don't think so, you might want to apply a filter before though, as described above. Can we reduce the problem to applying filters and then scrolling from there sequentually?

niko-ng commented 2 weeks ago

OK, but then this would be something internal to NextGraph ? Because if we want to conform to ActivityPub specs, the published date must be a ... date.

My bad f you use SPARQl for ordering, better! but indeed, the 30seconds come from the fact that SPARQL cannot index dates. You would store both a date and a timestamp, and return only the date to ActivityPub compliant APIs. the timestamp would just be used internally. Or if you want to save space, you would not save the dateTime, and instead regenerate it from the timestamp at the moment of outputting the data towards ActivityPub systems. Nothing specific to NExtGraph here. tried to stay away from anything specific to NextGraph in my answer, because I know it concerns a task that is starting now.

niko-ng commented 2 weeks ago

Well the specs about OrderedCollections do not specify that, so in theory we should be able to add items anywhere we want. But in ActivityPub, OrderedCollections are used only for inboxes and outboxes (and I don't know of other usages in the fediverse, but I'm not an expert), so naturally it should be added at the beginning or the end (the most recent activities should appear on the first page, but we could use a reverse order).

If we have the choice, then better to add at the end (and reverse order if needed) because adding at the end is always simpler for managing pages.

The question of "insert in the middle" might occur if you have activities added to the inbox by example, by a remote server, who was offline for some time (for some technical reason) and that will add "outdated" activities, that have an as:published date that is slightly behind the current time. If some more recent activities are already present in the inbox, then those incoming "old" activities will be inserted "in the middle".... what do you think of this case? can it occur?

niko-ng commented 2 weeks ago

Do we really need special SPARQL queries here ? If we want the items 20 to 30 in a rdf:seq, can't we just retrieve rdf:_20, rdf:_21, and so on ?

Yes we can, but each "fetch" will be inefficient. And we need to take into account that if there are some insert in the middle, then all the rdf:_xx above should be modified. Same if there is a delete in the middle.

Eventually, it will be good to know if we really need "random access to pages" or if we can use only next and previous.

And also, maybe use a special kind of URL for next and previous that in fact, contains the URI of the last resource in the current page (for next) or the first resource of the current page (for previous) so that it will be easy to find the next/previous items and build the page on the fly. (like GraphQL is doing, as you say). This mechanism also works if there were some concurrent inserts or deletes between the navigation from one page to another occurs, so that the page size is always correct. And also it works without persisting any metadata about the pages.

niko-ng commented 2 weeks ago

I don't think so, you might want to apply a filter before though, as described above. Can we reduce the problem to applying filters and then scrolling from there sequentually?

yes that would be great. It also solves the "before 2023" question. you just need to find the starting point, and then iterate pages with next. Finding the starting point is one SPARQL query, that can be quick.

Filtering can also be quick, specially if the ordering is based on long integer timestamps. as explained above. The 30 sec. you see for now is because Jena doesn't keep indexes on DateTime values.

niko-ng commented 2 weeks ago

BTW, I just saw that OxiGraph is automatically converting DateTime dataTypes to long in the storage, and back to DateTime when outputing. This is smart because indexing on long is automatically done. I remember that Jena is not doing that and is not indexing DateTime.

@srosset81 and @Laurin-W would you have a moment to answer my questions above so we can have a final decision on this topic?

Laurin-W commented 2 weeks ago

About servers coming back online and synchronizing that with your collection -- I think that makes sense, if you want to have a chronological order. I'd say we can live without that, if it makes things easier but it's definitely a nice to have!

Eventually, it will be good to know if we really need "random access to pages" or if we can use only next and previous.

Agree, I don't think we need that. But filtering would be nice, as detailed above. With that, I can't think of a good use case where you want to have items 580 to 590, if you don't know what's before or after... The only thing that came to my mind was that if you had search results ranked by relevance somehow and you wanted to know how lower-ranked results look like, you could jump to page 15 or so. But that's not really a use-case for social media, if at all.

And also, maybe use a special kind of URL for next and previous that in fact, contains the URI of the last resource in the current page (for next) or the first resource of the current page (for previous) so that it will be easy to find the next/previous items and build the page on the fly.

I like that idea! :)

BTW, I just saw that OxiGraph is automatically converting DateTime dataTypes to long

Nice! :D

niko-ng commented 1 week ago

Thanks for your reply @Laurin-W

So if I summarise, the solution could be :

we do not persist pages. they are generated on the fly.
we do not allow "random access to pages", saving us a lot of trouble.
instead, each page has a next and previous link, that includes the last or first item in the current page (respectively) in order for the pod provider (the middleware) to construct the next or previous page
all pages have a first and last link, that is just "URI of collection"+"/last" or +"/first". and never changes. It is the middleware that will generate those page dynamically.
a parameter can be passed to define the page size (called page_size ?), that can also have a default value set system-wide.
we do not use rdf:seq as it isn't needed
the ordering is done by the SPARQL engine (oxigraph) with the timestamps that are indexed automatically. On Jena, we will have to convert he DateTime values into long (UNIX timestamp), and store them this way. And then when presenting the data in a page, convert the long back to a DateTime.
when a new page needs to be generated, we take the starting item and FILTER (?time > [last]) with it, then LIMIT and ORDER BY
additional filtering can be done with SPARQL before the LIMIT. any combination is possible.

Laurin-W commented 1 week ago

Perfect, that sounds good! Thanks for writing this summary :)

elf-pavlik commented 1 week ago

I only skimmed over the details of this conversation. When AS paged collections were drafted, besides parallel work on LDP pagination, there was also an exploration of pagination in Hydra CG https://www.w3.org/community/hydra/wiki/Collection_Design#Pagination

The one in Hydra was my favorite at the time.

BTW you might consider submitting your pagination, ordering and filtering use cases to the Linked Web Storage WG https://github.com/w3c/lws-ucs

niko-ng commented 1 week ago

Thanks @elf-pavlik for looking into this issue and finding it worthy to be presented to LWS WG. We have a look at the Hydra collections indeed, but they do have the caveat of relying on ?page=2 which is a "random access to pages" mechanism, which we want to avoid, as explained above.

BTW, while you are here, I want to ask you your opinion: What do you think about this feature of "random access to pages" that is hard to implement and that we want to specifically not implement. Do you see some use case to access a random page number ? I believe that starting from somewhere and then doing next > next > next or previous < previous < previous is enough for all cases of scrolling into a collection, streaming it, and going to the "current head" and "first item"

srosset81 commented 1 week ago

Thanks @niko-ng for your proposal and sorry for my late reply.

So I understand that if we store the as:published predicate as a timestamp, the ordering will be faster. But if there are 6000 items in the collection, Fuseki will still need to go through the timestamp of these 6000 items (and more importantly, verify that the user has the permission to see each item because otherwise it won't be able to access the timestamp) to sort them correctly. So will performances be much better ? I wonder. And does it matter so much to go through pages with the URI of the first/last item if Fuseki has already sorted the 6000 activities anyway ?

a parameter can be passed to define the page size (called page_size ?), that can also have a default value set system-wide.

This is actually persisted in the collection, with the semapps:itemsPerPage predicate. We could allow to change this parameter with a query string, but this is not standardized. Maybe it's better to add this option when we tackle the issue of filtering collections (which is also not standardized)

If the performances can really be improved with this proposal, the only thing that bothers me is that we will not conform with ActivityPub specs, which expects a list as we can see in the official JSON-LD context. I found again this thread where people criticize this choice of not perstiting the order. It does not have an impact for ActivityPub servers, which don't care about how you store data. But we will still store data in a way that is particular to SemApps.

It's a choice we can do, but it's important to be aware of this downside.

If the performances are much better with Niko's proposal than if we persist the order, then I'd say performances matter more than standards compliance in that case.

elf-pavlik commented 1 week ago

We have a look at the Hydra collections indeed, but they do have the caveat of relying on ?page=2 which is a "random access to pages" mechanism, which we want to avoid, as explained above.

Looking at Hydra's PartialCollectionView hypermedia controls, I see only four affordances: first, previous, next, and last. The edges would only have two; for example, the first part would have next and last. I imagine a URI Template would be required to provide a functional hypermedia control that allows access to a random page. Otherwise, clients have to treat all URIs as opaque.

I don't find accessing a random page/part based on its index useful. Filtering the collection before paging it would be more practical, e.g., all the activity from the last month. I still haven't looked at your issue about filtering

1251

niko-ng commented 1 week ago

I see only four affordances: first, previous, next, and last

Yes. Those next and last are similar to what as:OrderedCollectionPage has. And this is what we will use. But the question is about what to put as the URL value in the next and last properties.

And I was just saying that in their example, they put URLs that have a query parameter ?page=2. It is just as an example I guess, but it denotes that they consider their PartialCollectionView to have a URL of this form, which is not something we would like to implement, because it comes with a high performance cost.

But yes generally I agree that both Hydra and AS ontology for pagination is very similar. And honestly I don't know which one is best.

pietercolpaert commented 5 days ago

Let me add another alternative next to Hydra and AS:

I proposed the TREE specification for that use case: the server has a fixed pagination and the client needs to follow links. However, these links are explained: i.e., you can follow a link to a page that is indicated to contain for example, entities later in time. This way you can build search trees instead of just providing traversal in 1 or two directions with next and previous, and the client can understand a sense of ordering.

assemblee-virtuelle / semapps

OrderedCollection: persist items order #568

1251