Badge is not updating in a timely way

Fryguy commented 1 year ago

I recently achieved 100% for my project, but the badge continues to say 99%. I've tried clearing cache and cookies, so I think it's something server side? Perhaps it's run on a schedule?

Also see https://www.bestpractices.dev/en/projects/4282

Fryguy commented 1 year ago

Gah, I just noticed https://github.com/coreinfrastructure/best-practices-badge/issues/2070, so might be a duplicate of that one.

justinmclean commented 8 months ago

I'm also having the same issue: https://www.bestpractices.dev/projects/8358

david-a-wheeler commented 8 months ago

Drat. We made a number of changes that should have completely eliminated this problem. I notice that https://www.bestpractices.dev/en/projects/4282 is at 100%, including its listing in /projects, so it eventually updates, but it takes its time. Too much time.

We specifically tell our CDN (Fastly) to throw away the badge image. My best current hypothesis is that there's a race between where we tell the CDN to drop it and the retrieval of the image. We update the data before we tell the CDN to drop it, and take various steps to make sure it works 100% when tested locally. However, on the deployed system, if the two actions are not handled in the order we supply, then this could be the result. That would explain a number of problems.

If this hypothesis is correct, then I think there are two things we can do:

In the /projects list, if the change was made recently, force a specific image to be shown. We already do this on the specific badge entries, for the same reason.
After a badge level change, resend a "throw away the cache of this image" request after a period of time. That way, if the CDN receives the commands out-of-order, it'll eventually clear the cache and get the correct answer this time.

Other ideas welcome.

david-a-wheeler commented 8 months ago

BTW, congrats on earning the badge!

david-a-wheeler commented 8 months ago

We've made changes (as I said), but I've left this issue open because it's notoriously hard to be sure that you fixed a race condition - it's not a bug that reliably reproduces.

Fryguy commented 8 months ago

I'm fine with closing this one, especially if #2070 is the same issue.

zapek commented 2 months ago

Still causing problems. https://www.bestpractices.dev/projects/9469/badge shows 88% The following workaround works: https://www.bestpractices.dev/projects/9469/badge?foo

david-a-wheeler commented 2 months ago

Sadly, https://www.bestpractices.dev/projects/9469/badge shows "passing" to me. It's hard to fix it when I can't reproduce :-(.

I have a request: Can you do a force-reload on your web browser? Basically go to the page https://www.bestpractices.dev/projects/9469/badge , hold down the "shift" key, and while holding down the "shift" key press the reload button:

Source: https://shaminospage.blogspot.com/2023/01/quick-tip-use-shift-refresh-in-your-web.html

If that fixes it, GREAT. If not, I could purge all caches, but since it shows correctly for me I don't think that would solve anything.

zapek commented 2 months ago

Ok, I think I've got it. It was really puzzling to figure out. This is on the same computer, first with Windows' curl:

curl https://www.bestpractices.dev/projects/9469/badge
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="184" height="20" role="img" aria-label="openssf best practices: passing"><title>openssf best practices: passing</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="184" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="133" height="20" fill="#555"/><rect x="133" width="51" height="20" fill="#4c1"/><rect width="184" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="675" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="1230">openssf best practices</text><text x="675" y="140" transform="scale(.1)" fill="#fff" textLength="1230">openssf best practices</text><text aria-hidden="true" x="1575" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="410">passing</text><text x="1575" y="140" transform="scale(.1)" fill="#fff" textLength="410">passing</text></g></svg>

and now with using Brave (Chrome, really)'s copy network request as Powershell request, which is supposed to mimic the original request using Powershell:

Invoke-WebRequest -UseBasicParsing -Uri "https://www.bestpractices.dev/projects/9469/badge" `
>> -WebSession $session `
>> -Headers @{
>> "authority"="www.bestpractices.dev"
>>   "method"="GET"
>>   "path"="/projects/9469/badge"
>>   "scheme"="https"
>>   "accept"="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"
>>   "accept-encoding"="gzip, deflate, br, zstd"
>>   "accept-language"="en-CH,en;q=0.9,fr;q=0.8,es;q=0.7"
>>   "cache-control"="no-cache"
>>   "pragma"="no-cache"
>>   "priority"="u=0, i"
>>   "sec-ch-ua"="`"Brave`";v=`"129`", `"Not=A?Brand`";v=`"8`", `"Chromium`";v=`"129`""
>>   "sec-ch-ua-mobile"="?0"
>>   "sec-ch-ua-platform"="`"Windows`""
>>   "sec-fetch-dest"="document"
>>   "sec-fetch-mode"="navigate"
>>   "sec-fetch-site"="none"
>>   "sec-fetch-user"="?1"
>>   "sec-gpc"="1"
>>   "upgrade-insecure-requests"="1"
>> }

StatusCode        : 200
StatusDescription : OK
Content           : <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="234"
                    height="20" role="img" aria-label="openssf best practices: in progress 88%"><title>openssf best
                    practices…
RawContent        : HTTP/1.1 200 OK
                    Connection: keep-alive
                    Server: Cowboy
                    Report-To: {"group":"heroku-nel","max_age":3600,"endpoints":[{"url":"https://nel.heroku.com/reports
                    ?ts=1726923354&sid=af571f24-03ee-46d1-9f90-a…
Headers           : {[Connection, System.String[]], [Server, System.String[]], [Report-To, System.String[]],
                    [Reporting-Endpoints, System.String[]]…}
Images            : {}
InputFields       : {}
Links             : {}
RawContentLength  : 1248
RelationLink      : {}

Notice how the first request is "passing" and the 2nd one is "88%". I believe Fastly caches by IP address and if some headers match as well. I also get 88% when using Edge or Firefox but "passing" when using plain cURL (also on Linux).

With the original Browser (Brave), using shift + enter to load the URL, it gets 88% (I see the 200 reply, and not 304 in the Network tab in the inspector so it's not local caching). But if I use the Tor feature in the same browser (so it uses a different IP address), I get the passing badge (incognito mode gets 88% so it's not caused by it (Tor mode enforces incognito as well)).

Last test with my phone (Brave):

using wifi (same public IP as my computer): 88%
using 4G network (different public IP): passing

david-a-wheeler commented 2 months ago

THANK YOU so much for that information! I'm still mystified about exactly what the problem is - and thus how to fix it - but this definitely provides much more specific information to help us track this down. Thank you for providing all that!

Notice how the first request is "passing" and the 2nd one is "88%". I believe Fastly caches by IP address and if some headers match as well. I also get 88% when using Edge or Firefox but "passing" when using plain cURL (also on Linux).

I can confirm the header matching, we even specifically identify the headers to use. It caches my source address (really domain name not just IP address), but of course that makes sense, different web sites serve different data :-). As far as the requester IP address goes, that has an indirect effect as I understand it. Fastly has a large number of different intermediate caching systems; if a client has a different IP address, it's more likely to end up connecting to a different Fastly system for a cache.

With the original Browser (Brave), using shift + enter to load the URL, it gets 88% (I see the 200 reply, and not 304 in the Network tab in the inspector so it's not local caching). But if I use the Tor feature in the same browser (so it uses a different IP address), I get the passing badge (incognito mode gets 88% so it's not caused by it (Tor mode enforces incognito as well)).

It sounds like different systems you're using are deterministically getting different answers. That is surprising, but I have a hypothesis. Those intermediate systems implemented by Fastly may have more than 1 computer, and it's possible that they determinististically determine which one to use based on the client IP address and port number (instead of using a round-robin to determine which one is used). I expect that your systems go through a NAT (e.g., on your router), so Fastly will see different port numbers on different applications. Result, in this hypothesis, is that they'd talk to different Fastly systems. If those different systems have inconsistent data, then we have an explanation.

Maybe we need to send several cache clears, over time, for a given data value, since the result may not "take". Fastly is very fast (that what it's designed to do!), but this may suggest that the speed creates some subtle race conditions we must account for. Anyway, that's my current hypothesis.

david-a-wheeler commented 2 months ago

Okay, I think I have a set of hypotheses & solution if the hypotheses are correct.

On Sun May 12 18:35:02 2024 -0400 we merged commit 69afb7fb06e298951e4d5b579d795930dcf4e5af to fix a race condition on CDN caching (details there). The problem described here predates that fix, so you could argue that "well, the problem was before the fix".

The problem with that hypothesis is that caches expire. The bad cached data should have long since expired, and now have good data provided. But we now use shorter cache times; it's possible that these specific Fastly systems received data when we used longer cache times, and those haven't expired.

Okay, but when there's an update, we send out a request to Fastly to purge old data... that should have eliminated the old bad data. After reviewing things, though, there's a quirk that might explain this. We purge the Fastly data immediately and resend it on a delay; the delayed resend should reset everything. However, the delayed resend is in a job stored in RAM. The best practices site reboots once a day; it's possible the system was rebooted while the job was pending. If that is the problem, the solution would be to move the job queue to the database, making the database a little busier but persisting the job.

So I think I'll force a "purge all" on the production site. That will clear all caches, and that should fix the problem for all current badges. If that works, I'll look into moving the jobs into the database (that shouldn't be difficult, just take a little time).

david-a-wheeler commented 2 months ago

I've purged all cached data, so you should only get correct data now.

I'm a little suspicious of my hypothesis. Even in a race condition, once the badge expired it should have gotten a new one. I don't have a better idea for a hypothesis, though.

zapek commented 2 months ago

Still getting the old badge with shift + reload.

david-a-wheeler commented 2 months ago

Still getting the old badge with shift + reload.

That... shouldn't be possible... {BOGGLE}. Sorry that it's still not working.

I'm running out of explanations. Maybe our API call to Fastly failed silently?! So, I've logged directly into Fastly and invoked a forced "purge all" of all cached values, through directly demanding it of them.

Please do shift + reload again, and let me know if all is okay now. That should DEFINITELY fix it. If that does NOT fix it, I'll be completely perplexed and annoyed at Murphy.

david-a-wheeler commented 2 months ago

If shift-reload still fails, I need to review the cache-control settings. This should not be happening with the values we send. So if it's happening, my best guess is that the "settings we think we're using aren't the ones we're using'.

zapek commented 1 month ago

I just re-tested now and it finally works! passing badge.

coreinfrastructure / best-practices-badge

Badge is not updating in a timely way #2072