Remora / Remora.Discord

A data-oriented C# Discord library, focused on high-performance concurrency and robust design.
GNU Lesser General Public License v3.0
245 stars 46 forks source link

Caching inconsistencies #189

Open aPinat opened 2 years ago

aPinat commented 2 years ago

Description In Remora.Discord.Caching there are instances where when a gateway event occurs or an API call is made, that a 'child' entity is updated in the cache, however a 'parent' entity that holds information about this child doesn't get updated. This creates inconsistencies in the cache, where the supposed same entities may be different depending on where you're looking. This causes a variety of issues downstream in the library, an example of which I will outline below, but I think there are various similar issues.

Steps to Reproduce An example: The permissions for a role is updated. The responder below updates a specific guild role in the cache Guild:...:Role:..., however doesn't update Guild:...:Roles (or perhaps even Guild:...).

https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Backend/Remora.Discord.Caching/Responders/EarlyCacheResponder.cs#L228-L234

This in turn causes code that only checks for Guild:...:Roles to still have outdated and incorrect information: The CachingDiscordRestGuildAPI in GetGuildRolesAsync() returns outdated information about the changed role and still gives us the old permissions.

https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Backend/Remora.Discord.Caching/API/CachingDiscordRestGuildAPI.cs#L584-L614

This again may be used downstream for example in RequireDiscordPermissionCondition where a check for a permission goes through even though the permission was removed from the role.

https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Remora.Discord.Commands/Conditions/RequireDiscordPermissionCondition.cs#L164-L192

Expected Behavior

In this example, when requiring a certain permission for a command and then removing it from a role without restarting the bot instance, the command will error due to missing permissions.

Current Behavior

In this example, when requiring a certain permission for a command and then removing it from a role without restarting the bot instance, the command will still execute despite the user/their role missing the required permissions.

AraHaan commented 2 years ago

What would you think the desired fix could be?

aPinat commented 2 years ago

That is a good question and I have no idea. If it were just the one instance I would have fixed and PR'd that, but I think this is a greater issue that needs discussion and am not really confident in proposing a fix.

AraHaan commented 2 years ago

Also is it just guild roles/permissions affected, or are channels, members, etc also affected too?

aPinat commented 2 years ago

Well, it can affect anything. It always depends how you query for things. An example for channels would be querying GetGuildChannelsAsync, deleting a channel which evicts the cache for that specific channel, but GetGuildChannelsAsync will still return including that channel. Though this is not as problematic as the example above.

AraHaan commented 2 years ago

I guess what could be done is that it can also invalidate the results from the methods as well when:

aPinat commented 2 years ago

That would be a possibility. However one would have to evict the cache all the way to the top, i.e. in the first example, a role update means evicting the "all roles" cache, and the whole guild itself. This would mean losing a lot of still valid data. And maybe more importantly data, that is not returned by the REST API can only be cached through the gateway.

I have thought about possible fixes, but none are quite ideal:

Option 1: Update all caches all the way to the top instead of evicting them, so it is consistent everywhere. Though quite inefficient, since in the example one would need to update 3 cache keys, of which 2 need to be queried to update the original value instead of just overwriting.

Option 2: Remove all duplicate cache locations and only store 1 version of each cached entity in the entire cache. In our example that could be keeping everything in the main guild object cache and update the changed role in there and have all methods look in the big guild object. But loading the huge guild object for every small info isn't great either, especially for the cache in Redis. Or going the other way around, meaning in our example to cache each role by itself and not have them in the main guild object at all. However now when making a request for a whole guild object, that method would need to fetch each role (and other additional properties) to then return a combined guild object from all "sub"-caches.

AraHaan commented 2 years ago

I thought that fetching IGuild meant that it won't include the roles, members and channels by default, so basically I was doing queries for the roles / channels when I could have only fetched the guild by id?

Nihlus commented 2 years ago

So, generally, this is by design since data objects in the cache are immutable - this is for a multitude of reasons, but mainly thread safety. Discord itself operates on "eventual consistency" (which, to be honest, doesn't really mean much), so this cache behaviour is more or less expected. Discord also usually exposes routes to either get something like roles individually or by batch on their own, instead of going through another object which may or may not contain the values you're looking for.

The primary reason I have a short cache duration by default is so this issue is minimized. You can request a non-caching implementation if you're in a critical path that must always know the latest information, but this does come with the usual cacheless caveats.

I don't have a good fix either, but directly mutating values in the cache is out of the question. Creating new ones (fusing old information with new information) and replacing values is definitely something that could be done, and I have considered it in the past, but it means adding a lot of fragile code that needs to be maintained and updated.

For the moment, I would strongly recommend not relying on retrieving nested objects from related ones (IUser from IGuildMember or IRole from IGuild, for example) on critical code paths where eventual consistency is problematic, and using the direct routes instead (GetGuildRolesAsync, GetGuildMemberAsync etc.).

aPinat commented 2 years ago

Yeah I mostly agree. By updating the cache I didn't mean to mutate existing objects, but to fetch the old and save a new one as you point out. Also mutating existing objects wouldn't be compatible with the Redis cache either currently.

I thought that fetching IGuild meant that it won't include the roles, members and channels by default, so basically I was doing queries for the roles / channels when I could have only fetched the guild by id?

@AraHaan IGuild should contain the roles, however channels or members only are included if it is still cached from the IGuildCreate payload from the gateway.

For the moment, I would strongly recommend not relying on retrieving nested objects from related ones (IUser from IGuildMember or IRole from IGuild, for example) on critical code paths where eventual consistency is problematic, and using the direct routes instead (GetGuildRolesAsync, GetGuildMemberAsync etc.).

@Nihlus Agreed, though then the implementation of RequireDiscordPermissionCondition from the original example, by not querying the most direct way by using GetGuildRoleAsync for every single role of a member still leads to an incorrect result until cache timeout because GetGuildRolesAsync has an incorrect permission value in cache. I think this is still an important bug, because it would allow a role that has a permission stripped to still execute a command they shouldn't be able to use. But since I wasn't sure whether there are other instances where the cache would lead to something like this, I rather opened this general issue first.

Nihlus commented 2 years ago

Yeah - in that particular case, it's weighing a small time window of potential permission bypass from someone who did have the right permissions before against a large increase in API hits for bigger bots. I'm honestly not sure what the best way is right now :(

aPinat commented 2 years ago

If it is only a permission update on a role that can cause such a permission bypass, I think best option would be either to evict the :Roles cache or save a new IRole list to it where the updated role is replaced in the old list. A change in the cache responders for IGuildRoleDelete, IGuildRoleUpdate and for completion IGuildRoleCreate would be needed.

However I don't know the code good enough yet to know whether there are other cases where a critical case like a permission bypass can happen, other than this role update.

For other more unimportant cases like a channel delete and it still being in :Channels I think we can ignore (for now), since worst case we get a couple 404s by Discord, which isn't ideal, but definitely not as critical as the above can be.

Creating new ones (fusing old information with new information) and replacing values is definitely something that could be done, and I have considered it in the past, but it means adding a lot of fragile code that needs to be maintained and updated.

Or we do update the CacheResponders to implement what you had mentioned and what I would consider to probably be the best option, because then the cache expiration can be set very high without any issues as well and there is no need to query the REST API again, because state updates are sent to us over the gateway and we update it everywhere correctly in our store. It would also more closely mirror how Discords own client works, since they don't re-fetch a guild or channels from the API after getting them initially from the gateway and update them through gateway events. Though I don't think we would need to go as far as to update the whole IGuild cache for a single role change (but it would be correct), but at least update the IRole[] one.

AraHaan commented 2 years ago

I have been thinking of making a custom way to cache the thing without using MemoryCache by making a special collection type that will hold each type of item from the events then on the Create events store them, the update events invalidate what came from the Create events, then on the delete events evict them from the collection completely in an individual item basis (which for my bot could mean I could significantly reduce requests to Discord per second which would reduce the chance of getting temp api banned from hitting ratelimits constantly).

aPinat commented 2 years ago

I believe that is how other libraries handle it as well, and how I've gone ahead to cache/track VoiceStates and Presences. In theory I believe that is how the MemoryCache works as well, just without the wrapper around the ConcurrentDict or expiration. However the question is how to abstract that and make it work with the Redis caching package as well.

Nihlus commented 2 years ago

I have been thinking of making a custom way to cache the thing without using MemoryCache by making a special collection type that will hold each type of item from the events then on the Create events store them, the update events invalidate what came from the Create events, then on the delete events evict them from the collection completely in an individual item basis (which for my bot could mean I could significantly reduce requests to Discord per second which would reduce the chance of getting temp api banned from hitting ratelimits constantly).

This is exactly what Remora's caching does. The issue is that Discord doesn't always, for example, send a guild update immediately when you change a role, and then the cache is out of sync. The only way to get around this is by performing data fusing or mutation, or not caching compound objects (which I definitely think is out of the question).

I believe that is how other libraries handle it as well, and how I've gone ahead to cache/track VoiceStates and Presences. In theory I believe that is how the MemoryCache works as well, just without the wrapper around the ConcurrentDict or expiration. However the question is how to abstract that and make it work with the Redis caching package as well.

The main reason Remora doesn't cache voice states and presences by default is because they come in at astounding rates for large bots, and is generally not used by most bots. As for Redis, it works OOTB as it is at the moment.

aPinat commented 2 years ago

This is exactly what Remora's caching does. The issue is that Discord doesn't always, for example, send a guild update immediately when you change a role, and then the cache is out of sync. The only way to get around this is by performing data fusing or mutation, or not caching compound objects (which I definitely think is out of the question).

Not quite, because Remora doesn't fuse/update its cached compound objects (way better wording than how I tried to describe it). Thing is, Discord will never(!) send IGuildUpdate when you only change a role/channel/emoji etc. That is what IRoleUpdate, IChannelDelete etc are for. It is not only not sent immediately or delayed, it will never come. Only changes that actually need Discord to send the full guild, because there is no smaller payload defined, will send IGuildUpdate, such as a guild name change or icon change etc. Those are relatively rare, so I guess them sending a whole IGuild doesn't matter to them as much. I believe that we should be doing data fusion for the compound objects to keep all our cached objects up to date in addition to not having the cache expire, because that is what Discord expects/wants clients to do, instead of calling the REST API: https://discord.com/developers/docs/topics/gateway#tracking-state

The main reason Remora doesn't cache voice states and presences by default is because they come in at astounding rates for large bots, and is generally not used by most bots.

Yes, I think that is a good thing and was definitely not criticizing that. It was just an anecdote to @AraHaan's custom cache.

As for Redis, it works OOTB as it is at the moment.

Yeah, I was wondering how @AraHaan's caching solution would work with Redis. In our case of performing data fusion, it would cause a lot more roundtrips to Redis however.