Fix concurrent map writes on rd.locks

bryantraywick commented 3 years ago

We rolled out Caddy about two months ago and are using this module to store thousands of on-demand SSL certificates. Those certificates are coming up from renewal today and at 12:36PM EST Caddy crashed with a concurrent map writes error on this line. There is no lock protection around the rd.locks map access.

I am not an expert Go programmer so I wasn't sure if the Lock() and Unlock() calls should be more granular or if they will interact with the redis locks in a way that could cause a deadlock. I am going to be deploying this change to production today because I know that it is only a matter of time and luck before Caddy crashes again. I would appreciate it if you or @mholt could review this PR to ensure that it is the correct approach. I'd also be willing to test out any other fixes you may come up with for this bug.

gamalan commented 3 years ago

@bryantraywick can you provide more thorough scenario and which version of caddy do you use?

bryantraywick commented 3 years ago

We are using Caddy v2.3.0. Here is the full build-info. We rolled out Caddy mid December and over the first 2-3 days we generated >3000 on-demand SSL certificates. Those certificates now expire in 30 days and are being renewed and during those renewals Caddy crashed with a fatal error: concurrent map writes error on line 503 of storageredis.go in v0.2.5 of this module.

The issue is multiple go routines accessing/modifying rd.locks at the same time. My initial PR crashed as well this weekend with a slightly different error: fatal error: concurrent map read and map write. I updated my PR to use sync.Map and it has been running smoothly since then.

mholt commented 3 years ago

I forgot about the build-info command! It's not even documented. :upside_down_face:

Thanks for the patch! It most likely solves the actual data races, but @gamalan will have to check it for logical races. :+1:

gamalan commented 3 years ago

The problem is with this PR is this could introduce new issue if someone are using distributed caddy. Because it doesn't use shared lock. The lock are only available on the single instance of caddy, so the data itself are not locked when it being updated. It might work for @bryantraywick case but it might pose issue for other. Need more time to think.

gamalan commented 3 years ago

Ah, sorry, it seems I didn't read the code properly. Seems good to me. But I need to double check again.

On Tue, Feb 16, 2021 at 3:14 PM Gamalan notifications@github.com wrote:

The problem is with this PR is this could introduce new issue if someone are using distributed caddy. Because it doesn't use shared lock. The lock are only available on the single instance of caddy, so the data itself are not locked, when it being updated. It might work for @branytraywick case but it might pose issue for other. Need more time to think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gamalan/caddy-tlsredis/pull/25#issuecomment-779664373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIYFE4XZWBZVUGSV7M5I6DS7ISPXANCNFSM4XSPNBKA .

bryantraywick commented 3 years ago

I wanted to follow up on this because last night my custom Caddy build suddenly spiked in memory and CPU usage until it exhausted all resources. Caddy's memory usage spiked from 1.96GB to 7.32GB and the number of goroutines spiked from 25,000 to 72,500.

Screenshot from 2021-02-17 15-02-25

I suspect that this is potentially related to my changes or the semantics of the storage Lock and Unlock methods calls in general. We may want to move this discussion to an issue on the main Caddy repo because it appears to me that this race condition bug affects several other storage providers. The memory/CPU usage bug likewise appears to be related to this issue. I see that there have been recent commits to certmagic related to how certificate renewals are handled so this may be fixed already, but I suspect that an unbounded number of handshake renewals were kicked off, possibly recursively, last night that caused the high CPU and memory usage.

mholt commented 3 years ago

@bryantraywick What was the heap / goroutine dump? Caddy exposes it through the admin API at http://localhost:2019/debug/pprof -- no need for guesswork; would tell us exactly what all the resource usage is coming from.

The memory/CPU usage bug likewise appears to be related to this issue.

That proposal was never implemented because it has the potential for unbounded resource use. The current implementation bounds it. You still need enough, but it'd be worth looking into what happened in your case.

Edit: We still need the goroutine dump, but does it happen with the standard file system storage module? If not, it's probably related to this plugin. Locking implementation is tricky if you're writing the primitives yourself. Our file system module does this, I believe correctly, and it could be used as a reference for how to do something similar.

bryantraywick commented 3 years ago

@mholt Thanks for the additional insight and the tip to check http://localhost:2019/debug/pprof. Unfortunately, the kernel OOM killer took out Caddy shortly after it became unresponsive and memory started ballooning, but it I will be sure to capture that output if it does happen again. I haven't used the file system storage module except for testing because in production we use a pair of load balancer servers with a floating VIP in an active/standby architecture.

I'll look over the the file system storage module implementation tomorrow and see if I can come up with a better PR that is less prone to bugs. I mainly wanted to mention the high CPU and memory usage in case it was related to my PR. Unfortunately, it was killed before I could debug further so I don't know if it was my changes or something completely unrelated.

bryantraywick commented 3 years ago

Hey @mholt and @gamalan, I have refactored the Lock() and Unlock() methods to follow a structure more similar to the file system module. The Lock() method also now blocks until it can acquire the lock as it should. I believe the semantics are correct now. The MultipleLocks test now tests concurrent threads trying to acquire the same lock. The previous TwoLocks test was not semantically correct because Lock() should block rather than returning an error if it can't acquire the lock. I would appreciate a review of these changes for semantic and logical correctness and any other issues I may have missed.

I also had a question regarding the background freshness goroutine in the file system module for @mholt. The updateLockfileFreshness method keeps the lock's metadata updated until the lock has been removed, but what happens if thread that acquires the lock dies and never unlocks it? If I am understanding correctly, the keepLockfileFresh goroutine will continue refreshing the lock so that is never stale. Therefore, no other thread will ever be able to acquire the lock until Caddy is restarted and the lock eventually expires. Is that assessment correct, or is there some other mechanism that will cause the keepLockfileFresh thread to exit and stop refreshing the lock? I just want to make sure my changes are matching the behavior that Caddy expects.

mholt commented 3 years ago

but what happens if thread that acquires the lock dies and never unlocks it

Yeah, so, that's bad: the code that uses locks must be correct. Just like when using any of the features of the sync package. This is why, typically, unlocks happen in a defer, if there's any possibility of panics or errors.

bryantraywick commented 3 years ago

Yeah, so, that's bad: the code that uses locks must be correct. Just like when using any of the features of the sync package. This is why, typically, unlocks happen in a defer, if there's any possibility of panics or errors.

Thanks, that's what I thought; just wanted to confirm. The caller must unlock and the freshness background thread is to handle cases where Caddy crashes without releasing the lock (e.g. the background thread is no longer running and eventually the lock key will expire allowing another Caddy process to acquire the lock).

mholt commented 3 years ago

Yep! Exactly.

gamalan / caddy-tlsredis

Fix concurrent map writes on rd.locks #25