commercialhaskell / stackage-server

Server for stable, curated Haskell package sets
MIT License
106 stars 27 forks source link

Hoogle Search Fails #323

Closed JavariaGhafoor closed 5 months ago

JavariaGhafoor commented 8 months ago

For whatever I search, I get this error:

Screenshot 2023-12-29 at 5 48 20 PM
chreekat commented 8 months ago

@JavariaGhafoor this is fixed now - sorry for the wait!

See https://github.com/commercialhaskell/stackage/issues/7258.

tomjaguarpaw commented 5 months ago

It's happening again. (I didn't find any snapshot that claims to have Hoogle database available.)

chreekat commented 5 months ago

Arg. I will investigate.

tomjaguarpaw commented 5 months ago

Because this has become a recurring complaint maybe an automated check would be worthwhile?

chreekat commented 5 months ago

I think one lesson learned is that I should have put that stuff in first. As it is, it's high on the list of my "important + urgent" followup tasks for the handover.

chreekat commented 5 months ago

@tomjaguarpaw I've discovered how to manually repair a missing hoogle database by throwing files around on the server. Search should be working again. Unfortunately there is an underlying problem that will need to be fixed before the next LTS is released, or it will happen again.

chreekat commented 5 months ago

There are three underlying problems that may have conspired to create this outage.

  1. R2 is performing badly for PutObject and ListObjects commands.
  2. I had misconfigured the cache in front of the bucket such that it was caching 404's.
  3. Stackage decides a new snapshot is "ready" before the Hoogle database has finished being prepared.

I believe what happened is that Stackage decided lts-22.14 was available (3) before the database finished getting uploaded (1), leading to a window of time where Hoogle searches would come up empty. This time window was long enough that someone (or some thing) did, in fact, perform a Hoogle search. Stackage hit a 404 looking for the database, and it was dutifully cached by Cloudflare (2). Thus, even after the PutObject finished, Stackage kept getting a 404 for the database.

I have fixed (2). I was already working with Cloudflare staff to resolve (1). I'm still digging into (3), because it's honestly a problem even on its own. I would like to ensure that a snapshot doesn't "go live" on Stackage until the Hoogle database is already in place.

Note that I have not definitively identified this as the actual root cause. But all three problems do definitely exist, and they could cause an outage in the way outlined. Occam's razor etc.

chreekat commented 5 months ago

I also discovered a mystery: Stackage wasn't getting a 404 for the database, but it wasn't finding it, either. The logs said nothing. But every non-200 status code had logging associated with it! Where was the trail of execution going? Was some exception happening before the log lines could fire?

But I think I figured it out.

case responseStatus res of
    status
        | status == status200 -> do
            createDirectoryIfMissing True $ takeDirectory fp
            withBinaryFileDurableAtomic fp WriteMode $ \h ->
                runConduitRes $
                bodyReaderSource (responseBody res) .| ungzip .|
                sinkHandle h
            return $ Just fp
        | status == status404 -> do
            logWarn $ "NotFound: " <> display (hoogleUrl name bucketUrl)
            return Nothing
        | otherwise -> do
            body <- liftIO $ brConsume $ responseBody res
            mapM_ (logWarn . displayBytesUtf8) body
-- here     ^^^^^^
            return Nothing

If the response body is empty, the mapM_ results in no messages sent! :D