atomicdata-dev / atomic-data-docs

Atomic Data is a specification to make it easier to exchange data.
https://docs.atomicdata.dev
MIT License
17 stars 7 forks source link

Dealing with large 1-n relationships (Long arrays and chat feeds) #108

Open joepio opened 2 years ago

joepio commented 2 years ago

We need a way to deal with 1-n relationships for very big n values.

Imagine we have a chatroom app, using Atomic Data. If we'd model this using some ChatRoom that has a ResourceArray of messages, we'll end up with a very, very long ResourceArray. If someone wants the ChatRoom resource, they don't need the entire list of messages - they could do with a link, or a (paginated) subset. But the client needs a way to find the (latest) messages by looking at the ChatRoom.

We get a similar issue when we have a very large collection of Members of some organization. They all need read rights for some resource, but I don't want to have a read resourceArray at the top level resource that contains an array of thousands of items. The Client needs a way to get the Members by looking at the top level / organization resource.

But how do we model this?

Some approaches:

Allowing Collection URLs where ResourceArrays are used

Instead of having members: [1, 2, 3] we'd also allow members: https://atomicdata/dev/chat/1241/messages. This would allow servers to decide whether they want to include all items in the first response, or let the client get a controlled subset using a Collection.

What I don't like about this approach, is that it makes clients more complicated. They expected a resourceArray, now they get a string containing a URL. This means clients need logic that expects this. No problem for @tomic/lib to add this, but it might be confusing for other clients.

On the other hand, the resourceArray datatype already is a bit complex. It can contain anonymous resources, nested resources and URLs of resources. So clients already need custom logic to parse resourcearrays.

The server will need to construct a URL for the collection. How should it do this? Should we standardize this behavior?

Use query parameters in Resources

We could add a query parameter that allows for pagination inside a property of a resource. Maybe we can have something like paginateProperty={propertyURL, pageNumber}, or a base64 encoded JSON object?

Feels like a weird approach

Add optional nextPage and previousPage resources at the start and end of resourceArrays

We know that ResourceArrays can contain entire resources (objects) as well as URLs. Let's say arrays have the option to end with a NextPage resource, which contains a URL that should contain the next page.

Add a new DataType for Collections (note: current favorite)

Instead of members being an array, it is a different datatype: a Collection. The isA attribute is still there in the members Property.

Adding an item to the collection should work differently from making regular resourceArray Commits. When appending, simply create a new resource with the Collection as its Parent. The server should validate whether it is allowed to post to that Collection (authorization, shape validation).

I'm not sure how we should create the Collection. The app should do this for the user, anyway - manually creating the collection seems error-prone. Should the Collection be created before, after or while the Resource linking to the Collection is created? They both need to link to each other, that's for sure. Should the client know both URLs before creating both in two Commits? The front-end could first create the ChatRoom, then the MessagesCollection, and then update the ChatRoom to link to the messagesCollection. Doesn't seem that clean, though. Maybe the server should create a Collection Resource when the Class requires a Collection and none is present in the Commit.

I think the Collection should be a Table: #37 , since it constraints its children and is not just a query over existing resources.

The big advantage of this approach, is that it seems more trivial to implement server-side. Collections are stored with indexes, which allows for quick querying and pagination.

Adding things to the collection will also feel a bit weird, API wise. It will at least be very different from adding an item to a ResourceArray. But I think the creation discussion should depend on the Table issue #37

Verify incoming links

We could flip responsibilities mentioned in the examples in the intro: let the Message link to the ChatRoom, and let the Members indicate that they have read-write rights.

But this introduces a different challenge: we need to perform authorization checks. Does the user have the right to post to this chatroom? Does the member have read rights?

The beauty of outgoing links is that the resource containing the links is also the one being the source of truth. A resource describes its own authorization, so if you can edit it, you can authorize it.

This means we should verify incoming links, because otherwise anybody could say "hi I have read rights to this thing" and get access. This could mean that we have properties that require some right in the external resource. We already do this with parents, maybe we should also do it with other Properties? Maybe have some requiresAppendRights property? So we'd have a Message that has a chatRoom property. This chatRoom property has a requiresAppendRights set to true, which means that if the Message is parsed by the server, the server will check if the creator of the Message has append rights for the chatRoom.

Add a ChatRoom Endpoint with a nice API

Instead of solving the more abstract problem, I could opt for solving the specific chat API problem. We could add query parameters for the ChatRoom resource (using Endpoints) for things like from and to, similar to how Matrix does this.

But that does not really solve the more general 1-n problem.

jonassmedegaard commented 2 years ago

Why do messages need to be tracked in (virtually endless) arrays?

Blogging and Microblogging treats each message as an individual object, with a strong reference being time of publishing the post, and optional weak references to other message objects (notibly previous and next). ActivityPub and ActivityStreams uses that model, and if you want to go that route then I recommend to reuse most possible from those designs.

A variation is to not store chat messages at all, but treat them as ephemeral. openEngiadina seemingly did just that with their recent move to ActivityStreams-over-XMPP.

Another model is to treat each "room" as a graph of events. This is how matrix is designed - again I recommend to reuse as much as possible from existing design, if this is your desired approach.

jonassmedegaard commented 2 years ago

If you want simple clients, then don't invent a new chat-client-exposed protocol at all, but instead write a atomic-data-to-xmpp or an atomic-data-to-matrix or an atomic-data-to-sms service.

joepio commented 2 years ago

Why do messages need to be tracked in (virtually endless) arrays?

I don't think they can be arrays, because of performance. You can't serialize all messages when someone wants to fetch a ChatRoom resource.

Blogging and Microblogging treats each message as an individual object, with a strong reference being time of publishing the post, and optional weak references to other message objects (notibly previous and next). ActivityPub and ActivityStreams uses that model, and if you want to go that route then I recommend to reuse most possible from those designs.

That's like a linked list, right? That means every message needs to know about it's previous and possibly even next message. And it doesn't address the question of finding the messages from a chat room.

A variation is to not store chat messages at all, but treat them as ephemeral. openEngiadina seemingly did just that with their recent move to ActivityStreams-over-XMPP.

I want to store chat messages, at least for some usecases. There's also non-chat usecases, I'll add these to the issue description.

Another model is to treat each "room" as a graph of events. This is how matrix is designed - again I recommend to reuse as much as possible from existing design, if this is your desired approach.

Reminds me a little of the previousCommit property in Atomic.

But it still does not solve my underlying issue of dealing with large array-like objects.

If you want simple clients, then don't invent a new chat-client-exposed protocol at all, but instead write a atomic-data-to-xmpp or an atomic-data-to-matrix or an atomic-data-to-sms service.

I'm not per say trying to solve chats, but a more abstract problem of having a 1->n relationship which extremely high N's. I've updated the description to clarify this.

jonassmedegaard commented 2 years ago

I want to store chat messages

Fine. This and your aversion to linked lists indicates to me that the model you seek is similar to the one matrix.org seek to solve. So no need to discuss in detail how the other models are inadequate.

But it still does not solve my underlying issue of dealing with large array-like objects.

Not sure what you are saying here. Do you mean to say that matrix.org fails to solve the issue, or something else?

I'm not per say trying to solve chats

matrix.org is trying to solve graphs-of-events - some of which happen to be chat messages.

Some matrix clients have a UI optimized for chat messages, other (less common) matrix clients have different UI.

Seems relevant to me to consider if matrix.org has invented a wheel worthy or reusing instead of reinventing.

joepio commented 2 years ago

I'm not trying to reinvent matrix, I'm trying to find a way to model 1-n relationships in atomic data. Chat messages are just one use case.

And events are a solved problem in Atomic data, with Commits. Whatever the solution of these 1-n relationships will be, it has to work with commits.

Not sure what you are saying here. Do you mean to say that matrix.org fails to solve the issue, or something else?

I don't think matrix.org is trying to solve this issue of 1-n relationships, at least not in some abstract way. Their API does have some query options for chat rooms and messages. I could at least consider creating a similar ChatRoom endpoint instead of trying to solve the abstract problem, but I feel like solving the more abstract problem is the bigger win, if possible. Because even if I have the ChatRoom endpoint, I'd still be left with the other problem such as with the read rights that become too large.

jonassmedegaard commented 2 years ago

If what you are looking for is low-level optimized handling of arbitrarily long lists, then perhaps this recent LWN article is of interest. In itself the article is about kernel developers' struggle to codify list iteration in C, but the discussion also touches on alternative languages including Rust.

joepio commented 2 years ago

Looks interesting, thanks!

ianconsolata commented 2 years ago

I like the LinkedList modeled after ActivityStream / ActivityPub approach. It'll make it easier to integrate ActivityPub Federation with AtomicData Agents later, I think.

That's like a linked list, right? That means every message needs to know about it's previous and possibly even next message. And it doesn't address the question of finding the messages from a chat room.

LinkedLists do in fact need to keep track of previous and next messages, but that is not very hard with the right libraries. And they are much more flexible in their storage than arrays, specifically because they don't require things to be stored continuously, which solves your problem. This is essentially a generalization of the model you initially suggested -- with nextPage / previousPage you're just implementing a LinkedList of arrays. Why the preference for arrays? Is it just a performance concern, or is there some other constraint (like Atomic Data require that each message be a separate resource if it's a linked list)?

I'm not quite sure what you mean that it doesn't address the question of finding messages from a chat room. Could you elaborate? A chat room that used a linked list could reference the first message in a list (likely the most recent one) and then could easily find the full history by traversing the list.

joepio commented 2 years ago

This is essentially a generalization of the model you initially suggested -- with nextPage / previousPage you're just implementing a LinkedList of arrays.

That's true, in a way, but they are not stored in the back-end as linked lists. They are built using custom indexes, which are read in lexicographic order (useful for sorting by date / name, etc), which is then used to convert them into arrays.

I'm not quite sure what you mean that it doesn't address the question of finding messages from a chat room. Could you elaborate?

That would work for smaller chat rooms, but when its thousands of messages, you need some sort of pagination. Maybe my primary concern is: how will the user traverse these large lists? We already have Collections, these solve pretty much all the issues. By offering query parameters we let users paginate things.

A chat room that used a linked list could reference the first message in a list (likely the most recent one) and then could easily find the full history by traversing the list.

Traversing that list sounds hard to make performant, since every item will cause a lookup. That would be at least a little slow on back-ends, and very slow on clients.

I like the LinkedList modeled after ActivityStream / ActivityPub approach. It'll make it easier to integrate ActivityPub Federation with AtomicData Agents later, I think.

Do they use LinkedLists on a page level or resource level?

ianconsolata commented 2 years ago

They use linked lists on a message level, and then typically when serializing a chat you store multiple messages per resource (like an RDF:list). What is the difference between a page and a resource? Aren't those the same, or can a resource span multiple pages?

Why would every item have to cause a lookup? If you can store multiple LinkedList nodes per page / resource, it doesn't need to require any additional lookups than a paginated array would. Though, since AD doesn't allow for #fragment identifiers I'm not quite sure how to index multiple nodes in one resource cleanly.

And the performance really depends on what use cases you want to optimize for. LinkedLists are really good for expensive resources because they can be loaded / evaluated lazily. For the example you cited, Chat Messages, most of the time I only need the most recent few chat messages to display the current screen, and then I can lazily load the rest over time. That seems more efficient than using an array like structure that requires me to load the entire chat history array just to display the last few lines.

Collections do seem to solve this issue, so maybe I misunderstood the question. Rereading the original post again, I see two approaches that directly mention Collections: "Add a new DataType for Collections (note: current favorite)" and "Allowing Collection URLs where ResourceArrays are used". What's the difference between them? In general, APIs with fewer different datatypes are easier for folks to grok, so unifying ResoourceArrays and Collections in some way seems to make a lot of sense.

joepio commented 2 years ago

I see two approaches that directly mention Collections: "Add a new DataType for Collections (note: current favorite)" and "Allowing Collection URLs where ResourceArrays are used". What's the difference between them? In general, APIs with fewer different datatypes are easier for folks to grok, so unifying ResoourceArrays and Collections in some way seems to make a lot of sense.

The first one adds a new datatype, while the second one re-interprets an existing datatype. If we add a new datatype, the one desiging the Class (the data model) has to decide whether the arrays are Collection URLs or ResourceArrays. If we re-interpret the existing ResourceArray datatype, it's up to the one creating the data itself. But if that's the case, all of the complexity of Collections (including query parameters and all that) suddenly become part of the Atomic Data Core spec - which I'd rather keep as simple as possible.

I'm currently working on the ChatRoom implementation, where I've prevented using any new types of 1-n relationships. It's just a custom endpoint. That's the last element of the opening post, but yeah that doesn't really solve the underlying issue for other data models (such as the Read / Write rights, which are currently stored in large arrays).