Data storage API & persistence

Lonami / grammers

(tele)gramme.rs - use Telegram's API from Rust

https://t.me/gramme_rs

Apache License 2.0

561 stars 116 forks source link

Data storage API & persistence #265

Open MOZGIII opened 2 months ago

MOZGIII commented 2 months ago

I was looking into MTProto and grammers implementation for some time now, and I really can't understand how the persistence in grammers is supposed to work.

To elaborate: there is a session that the client and load and store, and the session contains the starting point from which to read and apply the differences - so, when you reload you are supposed to load the session to be able to avoid loading all the state again from scratch; at the same time, the only way the state is handed is in-memory - it is not persisted anywhere, and grammers client does not expose an API to allow user to handle the updates and persist them either.

What this means, I think, is that on restart grammers will loose the data it gathered from the past updates (chats and users, both for the connecting user and all channels) but will not reload them, thus getting into a state of only having partial data actually available. This kind of defeats the purpose of the clever design of the Telegram updates - as all the data is just lost in the end.

Maybe grammers client should not keep the data internally, but, instead, provide the APIs to offload the data storage to the user?

Alternatively, grammers could expose fns for accessing the internal data tables such that they can be persisted together with the session state - this would, in effect, allow no data loss, since it would be possible to replay the data after the last snapshot; the downside of it is the API boundary of the snapshots is moved to the application level, while with the fully offloaded data storage can utilize database/fs-level snapshots for this, which can be much more efficient and needs less work from the implementor.

Lonami commented 2 months ago

the only way the state is handed is in-memory

You can save the state to disk on demand:

https://github.com/Lonami/grammers/blob/d180f024e553f05c6a5f120484ad68d2a2c5f73d/lib/grammers-client/examples/echo.rs#L97

My idea was that as soon as you got an update:

https://github.com/Lonami/grammers/blob/d180f024e553f05c6a5f120484ad68d2a2c5f73d/lib/grammers-client/examples/echo.rs#L80-L84

the library could assume the update has been processed by the application.

Note that the state grammers saves is minimal:

https://github.com/Lonami/grammers/blob/d180f024e553f05c6a5f120484ad68d2a2c5f73d/lib/grammers-session/build.rs#L23-L31

No peers, no update contents. That's up to the application. AFAIK clients like TDesktop don't store these either (but it does cache files).

chats and users, both for the connecting user and all channels

The idea behind making it easier to "pack" users and channels is that the application would store those, instead of just the IDs. I agree this needs some more work though.

the purpose of the clever design of the Telegram updates

Not sure I follow, as my only opinion on this is that it's messy as hell. Though, I guess that's the point of it being abstracted behind a library.

Maybe grammers client should not keep the data internally, but, instead, provide the APIs to offload the data storage to the user

Happy to offer more knobs, or adjust the existing one to be more obvious / play better together. You're right my ideal may not have been implemented fully correctly.

With that said, in my mind, even if an application only persists update state every minute, I would expect it to be able to replay those with idempotency.

accessing the internal data tables

Except for the internal peer cache, which I'm not entirely happy with but is AFAIK a necessary evil due to how tied it is with updates, all the rest is in the session, which can be accessed as shown above.

I might be me something. I have not built a full application with the library. So again I'm happy to revisit this, if you have concrete ideas.

MOZGIII commented 2 months ago

Thanks for the quick feedback!

My main pain point is that the restarts of the app loose all the state the library keeps internally, like access hash maps. I will do further work to figure out the optimal changes to extend the grammers code to allow for what I have in mind, submit a PR and then we can discuss some concrete code.

Lonami commented 2 months ago

access hash maps

But this should be fine. It only means the library may need to fetch more data to get it back, but that's it. I do not want to persist hashes in the session. I believe storing what users the application interacts with, is best left to the application, because it can choose how to persist that.

MOZGIII commented 2 months ago

What way does the application have to detect that a new chat was added to the chats map, without reimplementing the updates handling logic by itself? Maybe I'm missing something...

persist hashes in the session

Yes, I'm not suggesting that...

the library may need to fetch more data to get it back, but that´s it

How would that work though? I don't see a way for the library, running as a bot, to know what channels the bot was added to. Our use-case is a channel members maintenance, where the bot would be an admin in a lot of the chats/channels, and we'd need to accurately react to each member-related update event, but also support resuming (and going through the missed updates) after we have restarted, and doing the bookkeeping that is not triggered by the chat/channel update.

That last part would be the issue - suppose the app restarts, and we know the channel id we want to, say, list participants at. We'd also need to have access hash to call channels.channelParticipants - but it would not be available in the library state after the reboot if the chat didn't have any updates in it. Am I correct? Am I missing something?

Also, we need a way to use the channel id to lookup the channel in the chats map - that only supports doing lookups by the peer id currently.

Anyhow, still planning to do what I suggested at https://github.com/Lonami/grammers/issues/265#issuecomment-2368780547 - this is just a bit more info about the issues we're facing as maybe there is no issue at all and I've just overlooked a solution...

Lonami commented 2 months ago

What way does the application have to detect that a new chat was added to the chats map, without reimplementing the updates handling logic by itself

None. But I'm not sure why we need to detect when a chat is added. If the library makes sure all peers from updates can be fetched, the application can fetch whatever it needs, when it needs it.

How would that work though? I don't see a way for the library, running as a bot, to know what channels the bot was added to

Getting difference would fetch all updates from the last-known point. This should include all seen channels (the state for each channel should be persisted).

access hash to call channels.channelParticipants

Note that this is a type, not a function. Getting information about a single participant needs a hash though, yes.

it would not be available in the library state after the reboot if the chat didn't have any updates in it

If the library caught up on missed updates, the updates about users joining should contain the hash, so it would be known again.

Also, we need a way to use the channel id to lookup the channel in the chats map - that only supports doing lookups by the peer id currently.

Not sure I understand. In the library I use the term "peer ID" to refer to the ID of any peer, including the IDs of channels.

MOZGIII commented 2 months ago

why we need to detect when a chat is added

Well, to persist the id and access hash in the database - for later. Precisely for what you tell to to later - to persist the state of the channels.

You say the application has to persist the channels state, but also that there are no ways to react to new channel appearing. Do you see the problem here?

the state for each channel should be persisted

Why should is highlighted here? Like, should, but no way to easily do it right now?

Not sure I understand. In the library I use the term "peer ID" to refer to the ID of any peer, including the IDs of channels.

Yes, but note that it is used incorrectly in the chats map as a key without the qualification of the kind of entry this id belongs to. See https://core.telegram.org/api/peers#peer-id - the docs explicitly state that ids of different kinds of entries overlap.

Lonami commented 2 months ago

to persist the id and access hash in the database

My thought was this would be done regardless for all incoming peers you could need, regardless of whether you knew that you had them or not.

Why should is highlighted here?

Bugs or situations I haven't thought of. It's how I tend to speak (I just did it again).

the docs explicitly state that ids of different kinds of entries overlap

Good catch. First time I see that. I suspect Daniil has continued to put great work in the documentation, so I can't say how recent that is.

I have never seen it happen personally, so I had assumed it could never happen, and was hoping either Grammers or Telethon could prove me wrong. Now that there's an official statement, I don't need to anymore.

All that said I'm still open to changes that could better suit your usecase. Building things is the best way to see what's missing, and I'm not building anything with the library at the moment, so I don't see the pain points you see.

Lonami commented 1 month ago

Another discussion about this and brief reading of the linked docs made me realize that grammers-session likely does need to become both far more powerful and store both IDs and hashes.

MOZGIII commented 1 month ago

Another discussion about this and brief reading of the linked docs made me realize that grammers-session likely does need to become both far more powerful and store both IDs and hashes.

Makes sense, if they suggest not exposing them. Just to clarify: you are planning to, effectively, include the peers database in the session? If that is the case - it makes sense. However, the session itself would probably then require an abstraction for the data storage layer - as we, for instance, wouldn't want to keep all the peers in memory all at once. I mean, going from loading and storing the session (and keeping it all in memory) to, maybe, introducing a Storage trait that would provide the interface for reading and writing all the session data (message box counters and peer ids / access hashes).

I have been slicing and dicing the client in the meantime, and figured if we just ignore that it exposes the chats map and handle the raw updates ourselves as they arrive we can maintain our own, persistent peers db, and keep it together with the session (in the database).

So, my code proposal is not ready jus yet, but it will likely be focused on extracting the layer of maintaining the chat maps into a separate entity so that if we (or someone else) build our own - it is possible to replace the stock one, and the client can actually skip this extra work it is not needed it; in our case we don't use any of the high-level APIs as they are too dependent on the other client features that we don't really want to pay for at runtime; still thinking about the proper API and namings to suggest - I would want to preserve the functionality of the high-level client for easy use for the users with less demanding / custom use cases - but at the same time it looks like the client currently contains a subset of operations that provides very simple and low-level operations (like maintaining the connections) that would be really beneficial to reuse. I wouldn't want to even attempt diving deep into a broad refactoring and subdividing of the client into smaller composable pieces on just my own - as I don't want to spend a lot of time on it and get it thrown away on review. That said, ultimately, transforming the client (and maybe session too) into a set of smaller composable modules would actually be my suggestion. How would you like to proceed?

Lonami commented 1 month ago

include the peers database in the session?

Yes.

introducing a Storage

Yes, a trait for Session is something we briefly discussed. Name bikesheddable, of course.

if we just ignore that it exposes the chats map and handle the raw updates ourselves as they arrive we can maintain our own, persistent peers db, and keep it together with the session

It's a shame to duplicate the work though. I guess the library can still do the heavy lifting with update ordering and resolving gaps though.

How would you like to proceed?

The README briefly touches on the ideal:

https://github.com/Lonami/grammers/blob/0dc27ee8a4c63067e3377b48eab8a35b883d7f87/README.md?plain=1#L14-L25

To expand a bit, grammers-client should, IMO, be exclusively high level. No raw API anywhere, except for .invoke, and a feature to gain access to the raw fields outside of semver.

grammers-session, with the discussion here, likely needs to grow support for better storages (likely default to SQLite and storing the access hashes, as it's done in Telethon, but using traits so users can replace the storage or bypass it entirely). It would be good to move as much of update handling as a sans-io implementation to the session too.

grammers-mtsender should ideally be responsible for providing strong connection guarantees.

So in my mind, if one doesn't want the high-level features, they would use the mtsender and session crates together, and drop client entirely.

(Though of course, ideally the client would be powerful and have a low-enough abstraction cost where most would use it, as it bundles everything together.)

MOZGIII commented 1 month ago

So in my mind, if one doesn't want the high-level features, they would use the mtsender and session crates together, and drop client entirely.

Yes, this is exactly what we're doing, however there are a lot of things that are implemented directly in the client that I could benefit from when building, effectively, our own client.

Concretely, so far:

authentication
updates and gap handling loop

Also, I find it odd that client implements the connection the way it does, since it would be much more flexible to just have a Connector trait, and then maybe have a proxy implementation of the connector in a separate type (vs features - as that is a mess to read, modify and test).

Thus I'd suggest the following changes, as I see them as huge improvements:

extract an updates loop from the client, maybe into a standalone crate or type; decouple it from the chat maps, and make a trait for handling the incoming users and chats from the Updates - this is what I've been working towards lately, and once it gets into shape I'll share my proposal in the form of PR
move the connection establishment logic into a trait, and move from implementing proxy support via cfg mess into high-level trait-based interfaces
extract authentication utilities from the client into the more reusable pieces; technically, it should be possible to extract anything that doesn't require the chat map

Ultimately, we'd prefer using the client if it allowed us to avoid doing the extra work - which can be perfectly expressed via traits but with, probably, a bit too much complexity in the API surface for the high-level client. I suggest splitting the client into client core - the low-level thing that ties the connection, session, mtsender and updates loop together, but allow for generic interfaces for what the chats map is and how it is maintained - and the (just) client, that would be the same high-level interface that it is today, and would be built on top of the client core with specifying a particular implementation of the chats map (this is the rough idea, might also be more generics between the two).

I'm currently working on essentially our own lightweight version of the client (i.e. client core) atm wit all that in mind, and will be happy to upstream it if it is approved when I showcase it.

Lonami commented 1 month ago

Concretely, so far:

* authentication

* updates and gap handling loop

Fair enough. Those do seem hard to decouple even further.

client implements the connection the way it does, since it would be much more flexible to just have a Connector trait, and then maybe have a proxy implementation of the connector in a separate type (vs features - as that is a mess to read, modify and test)

Fully agree (connectors were also discussed). Just like in Telethon, proxy support was not added by me but someone else, and I kinda took it in because many people need it.

extract an updates loop from the client, maybe into a standalone crate or type; decouple it from the chat maps

Unfortunately chat maps are pretty tied to update handling. Quoting Working with Updates - Recovering gaps:

Incomplete update: the client is missing data about a chat/user from one of the shortened constructors, such as updateShortChatMessage, etc.

So, even if we separate them, I think it would make sense to keep both in the grammers-session.

move the connection establishment logic into a trait, and move from implementing proxy support via cfg mess into high-level trait-based interfaces

I eagerly await this change!

extract authentication utilities from the client into the more reusable pieces; technically, it should be possible to extract anything that doesn't require the chat map

I do wonder how far you'll be able to take this. The password logic makes sense for grammers-crypto to handle. But the handling of different error states, not so sure.

Maybe it's enough to leave that as standalone functions in the grammers-client crate and let dead-code-elimination get rid of the entire Client.

I suggest splitting the client into client core [...]

Seems reasonable. But probably should remain in the grammers-client crate.