Open brandonsturgeon opened 1 year ago
As an extension to this issue, it seems that Express isn't managing its cache correctly.
I notice on my long-running server, even with retries and send delay, sometimes data that is cached serverside isn't able to be downloaded. This means that the addon isn't expiring its cache as quickly as the items actually expire on the Express Service.
In my testing, the Retries+Send Delay seems to work as expected. I'll fix this caching bug and then tag a new release.
This problem appears to have subsided with the merge of #34 👍
This is a really annoying issue that, according to Cloudflare's docs, shouldn't be happening.
The Problem
Cloudflare KV docs state:
So what they're saying here is that KV achieves low latency by caching KV lookups. On the first read from a given location, it'll get a cache MISS and have to traverse the Cloudflare nodes to find the actual value, and then it'll cache that value to the location where the lookup occurred.
This is fine. Good, even - we can deal with that added latency for a single client.
A quick refresher on how Express works: Our current flow is something like this:
However, what we're actually seeing, is the Recipient asks Express Service for the UUID, and gets a 404!
How does that work? It's not possible for it to actually be a 404 at this point because the data definitely exists in KV. We wouldn't have the UUID unless it were already stored!
The worst part about this bug is that this 404 is actually cached for that location! Meaning anyone else trying to read the Data for that UUID just gets a cached 404 🤦♂️
The Solutions
Assuming this isn't some obscure bug with the Express Service (I suppose it's possible), there are a few different ways to tackle this.
1. Have clients retry for the data _(Proposed in https://github.com/CFC-Servers/gm_express/pull/34)_ We could just make the client ask for the data over and over until the cache busts? 😅
2. Have the sender wait a brief moment before sending the net message with the data's UUID _(Proposed in https://github.com/CFC-Servers/gm_express/pull/34)_ If it really was a timing issue, perhaps delaying the net message by even a few fractions of a second could increase the chance of a successful first GET. This would be a very minor delay, but it would increase the minimum time of every message by that much.
3. Create a comprehensive cross-region, bare-bones example demonstrating this as a Cloudflare bug and ask Cloudflare to fix it This should probably be done, if at least just to confirm that the bug is Cloudflare's like I'm describing it... But man, that's a lot of work.