Open aPinat opened 2 years ago
What would you think the desired fix could be?
That is a good question and I have no idea. If it were just the one instance I would have fixed and PR'd that, but I think this is a greater issue that needs discussion and am not really confident in proposing a fix.
Also is it just guild roles/permissions affected, or are channels, members, etc also affected too?
Well, it can affect anything. It always depends how you query for things.
An example for channels would be querying GetGuildChannelsAsync
, deleting a channel which evicts the cache for that specific channel, but GetGuildChannelsAsync
will still return including that channel.
Though this is not as problematic as the example above.
I guess what could be done is that it can also invalidate the results from the methods as well when:
That would be a possibility. However one would have to evict the cache all the way to the top, i.e. in the first example, a role update means evicting the "all roles" cache, and the whole guild itself. This would mean losing a lot of still valid data. And maybe more importantly data, that is not returned by the REST API can only be cached through the gateway.
I have thought about possible fixes, but none are quite ideal:
Option 1: Update all caches all the way to the top instead of evicting them, so it is consistent everywhere. Though quite inefficient, since in the example one would need to update 3 cache keys, of which 2 need to be queried to update the original value instead of just overwriting.
Option 2: Remove all duplicate cache locations and only store 1 version of each cached entity in the entire cache. In our example that could be keeping everything in the main guild object cache and update the changed role in there and have all methods look in the big guild object. But loading the huge guild object for every small info isn't great either, especially for the cache in Redis. Or going the other way around, meaning in our example to cache each role by itself and not have them in the main guild object at all. However now when making a request for a whole guild object, that method would need to fetch each role (and other additional properties) to then return a combined guild object from all "sub"-caches.
I thought that fetching IGuild meant that it won't include the roles, members and channels by default, so basically I was doing queries for the roles / channels when I could have only fetched the guild by id?
So, generally, this is by design since data objects in the cache are immutable - this is for a multitude of reasons, but mainly thread safety. Discord itself operates on "eventual consistency" (which, to be honest, doesn't really mean much), so this cache behaviour is more or less expected. Discord also usually exposes routes to either get something like roles individually or by batch on their own, instead of going through another object which may or may not contain the values you're looking for.
The primary reason I have a short cache duration by default is so this issue is minimized. You can request a non-caching implementation if you're in a critical path that must always know the latest information, but this does come with the usual cacheless caveats.
I don't have a good fix either, but directly mutating values in the cache is out of the question. Creating new ones (fusing old information with new information) and replacing values is definitely something that could be done, and I have considered it in the past, but it means adding a lot of fragile code that needs to be maintained and updated.
For the moment, I would strongly recommend not relying on retrieving nested objects from related ones (IUser
from IGuildMember
or IRole
from IGuild
, for example) on critical code paths where eventual consistency is problematic, and using the direct routes instead (GetGuildRolesAsync
, GetGuildMemberAsync
etc.).
Yeah I mostly agree. By updating the cache I didn't mean to mutate existing objects, but to fetch the old and save a new one as you point out. Also mutating existing objects wouldn't be compatible with the Redis cache either currently.
I thought that fetching IGuild meant that it won't include the roles, members and channels by default, so basically I was doing queries for the roles / channels when I could have only fetched the guild by id?
@AraHaan IGuild
should contain the roles, however channels or members only are included if it is still cached from the IGuildCreate
payload from the gateway.
For the moment, I would strongly recommend not relying on retrieving nested objects from related ones (
IUser
fromIGuildMember
orIRole
fromIGuild
, for example) on critical code paths where eventual consistency is problematic, and using the direct routes instead (GetGuildRolesAsync
,GetGuildMemberAsync
etc.).
@Nihlus Agreed, though then the implementation of RequireDiscordPermissionCondition
from the original example, by not querying the most direct way by using GetGuildRoleAsync
for every single role of a member still leads to an incorrect result until cache timeout because GetGuildRolesAsync
has an incorrect permission value in cache.
I think this is still an important bug, because it would allow a role that has a permission stripped to still execute a command they shouldn't be able to use.
But since I wasn't sure whether there are other instances where the cache would lead to something like this, I rather opened this general issue first.
Yeah - in that particular case, it's weighing a small time window of potential permission bypass from someone who did have the right permissions before against a large increase in API hits for bigger bots. I'm honestly not sure what the best way is right now :(
If it is only a permission update on a role that can cause such a permission bypass, I think best option would be either to evict the :Roles
cache or save a new IRole
list to it where the updated role is replaced in the old list. A change in the cache responders for IGuildRoleDelete
, IGuildRoleUpdate
and for completion IGuildRoleCreate
would be needed.
However I don't know the code good enough yet to know whether there are other cases where a critical case like a permission bypass can happen, other than this role update.
For other more unimportant cases like a channel delete and it still being in :Channels
I think we can ignore (for now), since worst case we get a couple 404
s by Discord, which isn't ideal, but definitely not as critical as the above can be.
Creating new ones (fusing old information with new information) and replacing values is definitely something that could be done, and I have considered it in the past, but it means adding a lot of fragile code that needs to be maintained and updated.
Or we do update the CacheResponder
s to implement what you had mentioned and what I would consider to probably be the best option, because then the cache expiration can be set very high without any issues as well and there is no need to query the REST API again, because state updates are sent to us over the gateway and we update it everywhere correctly in our store. It would also more closely mirror how Discords own client works, since they don't re-fetch a guild or channels from the API after getting them initially from the gateway and update them through gateway events.
Though I don't think we would need to go as far as to update the whole IGuild
cache for a single role change (but it would be correct), but at least update the IRole[]
one.
I have been thinking of making a custom way to cache the thing without using MemoryCache by making a special collection type that will hold each type of item from the events then on the Create events store them, the update events invalidate what came from the Create events, then on the delete events evict them from the collection completely in an individual item basis (which for my bot could mean I could significantly reduce requests to Discord per second which would reduce the chance of getting temp api banned from hitting ratelimits constantly).
I believe that is how other libraries handle it as well, and how I've gone ahead to cache/track VoiceState
s and Presence
s.
In theory I believe that is how the MemoryCache works as well, just without the wrapper around the ConcurrentDict or expiration.
However the question is how to abstract that and make it work with the Redis caching package as well.
I have been thinking of making a custom way to cache the thing without using MemoryCache by making a special collection type that will hold each type of item from the events then on the Create events store them, the update events invalidate what came from the Create events, then on the delete events evict them from the collection completely in an individual item basis (which for my bot could mean I could significantly reduce requests to Discord per second which would reduce the chance of getting temp api banned from hitting ratelimits constantly).
This is exactly what Remora's caching does. The issue is that Discord doesn't always, for example, send a guild update immediately when you change a role, and then the cache is out of sync. The only way to get around this is by performing data fusing or mutation, or not caching compound objects (which I definitely think is out of the question).
I believe that is how other libraries handle it as well, and how I've gone ahead to cache/track
VoiceState
s andPresence
s. In theory I believe that is how the MemoryCache works as well, just without the wrapper around the ConcurrentDict or expiration. However the question is how to abstract that and make it work with the Redis caching package as well.
The main reason Remora doesn't cache voice states and presences by default is because they come in at astounding rates for large bots, and is generally not used by most bots. As for Redis, it works OOTB as it is at the moment.
This is exactly what Remora's caching does. The issue is that Discord doesn't always, for example, send a guild update immediately when you change a role, and then the cache is out of sync. The only way to get around this is by performing data fusing or mutation, or not caching compound objects (which I definitely think is out of the question).
Not quite, because Remora doesn't fuse/update its cached compound objects (way better wording than how I tried to describe it).
Thing is, Discord will never(!) send IGuildUpdate
when you only change a role/channel/emoji etc. That is what IRoleUpdate
, IChannelDelete
etc are for. It is not only not sent immediately or delayed, it will never come.
Only changes that actually need Discord to send the full guild, because there is no smaller payload defined, will send IGuildUpdate
, such as a guild name change or icon change etc. Those are relatively rare, so I guess them sending a whole IGuild
doesn't matter to them as much.
I believe that we should be doing data fusion for the compound objects to keep all our cached objects up to date in addition to not having the cache expire, because that is what Discord expects/wants clients to do, instead of calling the REST API: https://discord.com/developers/docs/topics/gateway#tracking-state
The main reason Remora doesn't cache voice states and presences by default is because they come in at astounding rates for large bots, and is generally not used by most bots.
Yes, I think that is a good thing and was definitely not criticizing that. It was just an anecdote to @AraHaan's custom cache.
As for Redis, it works OOTB as it is at the moment.
Yeah, I was wondering how @AraHaan's caching solution would work with Redis. In our case of performing data fusion, it would cause a lot more roundtrips to Redis however.
Description In
Remora.Discord.Caching
there are instances where when a gateway event occurs or an API call is made, that a 'child' entity is updated in the cache, however a 'parent' entity that holds information about this child doesn't get updated. This creates inconsistencies in the cache, where the supposed same entities may be different depending on where you're looking. This causes a variety of issues downstream in the library, an example of which I will outline below, but I think there are various similar issues.Steps to Reproduce An example: The permissions for a role is updated. The responder below updates a specific guild role in the cache
Guild:...:Role:...
, however doesn't updateGuild:...:Roles
(or perhaps evenGuild:...
).https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Backend/Remora.Discord.Caching/Responders/EarlyCacheResponder.cs#L228-L234
This in turn causes code that only checks for
Guild:...:Roles
to still have outdated and incorrect information: TheCachingDiscordRestGuildAPI
inGetGuildRolesAsync()
returns outdated information about the changed role and still gives us the old permissions.https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Backend/Remora.Discord.Caching/API/CachingDiscordRestGuildAPI.cs#L584-L614
This again may be used downstream for example in
RequireDiscordPermissionCondition
where a check for a permission goes through even though the permission was removed from the role.https://github.com/Nihlus/Remora.Discord/blob/a117ef6c9b9b69ad058302b93f1a5b410a0c5eae/Remora.Discord.Commands/Conditions/RequireDiscordPermissionCondition.cs#L164-L192
Expected Behavior
In this example, when requiring a certain permission for a command and then removing it from a role without restarting the bot instance, the command will error due to missing permissions.
Current Behavior
In this example, when requiring a certain permission for a command and then removing it from a role without restarting the bot instance, the command will still execute despite the user/their role missing the required permissions.