caddyserver / certmagic

Automatic HTTPS for any Go program: fully-managed TLS certificate issuance and renewal
https://pkg.go.dev/github.com/caddyserver/certmagic?tab=doc
Apache License 2.0
5k stars 287 forks source link

Storage check -- can it be removed? #201

Closed ankon closed 1 year ago

ankon commented 2 years ago

https://github.com/caddyserver/certmagic/blob/76f61c2947a20d86ca37669dbdc0ed7a96fc6c5f/config.go#L456-L461

Depending on the storage this check can be quite expensive (in our case a remote KV store), and it happens in a rather "hot" area: Right when setting up a TLS connection. If we need to get a certificate from a CA this doesn't matter too much in the grand scheme of things, but if our storage has one we triple our roundtrips (1 to get the cert information from storage, and 2 for a write/read from storage to check it).

I think this check might have some value, but not here:

So, I'd propose to just drop these lines here, and possibly move the implementation of checkStorage as an example into documentation. WDYT?

mholt commented 2 years ago

I'm open to optimizations, but first:

It should already only do the check if a certificate is needed, which should only be "hot" once every 60 days or so -- and even that runs in the background except the very first time. Shouldn't be blocking a TLS handshake after the first time it gets a certificate.

What kinds of performance impacts are you experiencing? How have you verified that the storage checks are causing performance problems?

ankon commented 2 years ago

When the server using on-demand TLS with certmagic starts up, its in-memory caches are empty, and that's what I'm trying to optimize right now. Our Storage is using AWS DynamoDB (based on https://github.com/silinternational/certmagic-storage-dynamodb), and as much that scales, it does see quite some load I'd like to get rid off.

But, you're right indeed ... before it would do the storage check, obtainCert first checks with storageHasCertResourcesAnyIssuer for any existing resources. This call should return true (our implementation of Exists internally does a Load and checks for whether that worked), and following the code from there I think it will pick up the certs through an actual Storage.Load in Config.CacheManagedCertificate/Config.loadManagedCertificate.

Leaves me still with some not-so-nice performance if the above is true: I should end up with 3 distinct read requests in storageHasCertResourcesAnyIssuer, and another 3 in loadManagedCertificate. I'll have to measure this more closely to confirm this theory. Ideally I'd like this number to be 1, of course :)

I guess I could get there by a) Have a tiny internal cache with a short expiration for Load, so that the second load just grabs from that cache rather than roundtrip b) recognize the storageHasCertResourcesAnyIssuer and pre-load all keys into that cache (in DynamoDB through BatchGetItem or possibly by using a smart query/index operation)

It also IMHO still leaves the question open whether the storage check would not be better in the storage itself, so that an implementation could decide how to check?

mholt commented 2 years ago

The point of the storage check is to ensure the storage is properly configured, it's not really dependent upon how "trustworthy" the storage mechanism is.

It's tempting to say that just the file system needs such a check to ensure there's enough disk space, but actually most problems I've heard of in production stem from getting its CertMagic config wrong, or the actual storage system itself is misconfigured.

Maybe we could make storage checks optional, but disabling them poses a risk too.

(I suppose ideally we'd optimize them. I have a TODO that they don't need to happen so frequently, but haven't gotten around to implementing that. Even if it's just a simple memory of last time the storage was checked.)

ankon commented 2 years ago

What about sth like this:

diff --git a/config.go b/config.go
index ed155d5..266415d 100644
--- a/config.go
+++ b/config.go
@@ -976,11 +976,19 @@ func (cfg *Config) getChallengeInfo(ctx context.Context, identifier string) (Cha
    return Challenge{Challenge: chalInfo}, true, nil
 }

-// checkStorage tests the storage by writing random bytes
-// to a random key, and then loading those bytes and
+// checkStorage tests the storage
+//
+// If the storage implements `StorageChecker` its StorageChecker.Check
+// method will be called, otherwise the storage is checked by writing
+// random bytes to a random key, and then loading those bytes and
 // comparing the loaded value. If this fails, the provided
 // cfg.Storage mechanism should not be used.
 func (cfg *Config) checkStorage(ctx context.Context) error {
+   if sc, ok := cfg.Storage.(StorageChecker); ok {
+       // Storage knows how to check itself, so let it do that.
+       return sc.Check(ctx)
+   }
+
    key := fmt.Sprintf("rw_test_%d", weakrand.Int())
    contents := make([]byte, 1024*10) // size sufficient for one or two ACME resources
    _, err := weakrand.Read(contents)
diff --git a/storage.go b/storage.go
index 0bb34cd..4fc0cfe 100644
--- a/storage.go
+++ b/storage.go
@@ -81,6 +81,12 @@ type Storage interface {
    Stat(ctx context.Context, key string) (KeyInfo, error)
 }

+type StorageChecker interface {
+   // Check returns whether the storage is "good to go" or
+   // whether it cannot be used right now
+   Check(ctx context.Context) error
+}
+
 // Locker facilitates synchronization across machines and networks.
 // It essentially provides a distributed named-mutex service so
 // that multiple consumers can coordinate tasks and share resources.

For me it would allow me to implement the Check method (for instance by only checking once), for existing code nothing would change, and other more advanced storages could indeed implement some "once in while" behavior without needing a change in certmagic itself?

mholt commented 2 years ago

Maybe, I still feel like storage problems are more related to configuration than actual implementation (usually -- assuming a proper implementation and a capable storage backend).

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

ankon commented 2 years ago

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

Sure, that would work for us! Shall I make a PR to that effect?

mholt commented 2 years ago

@ankon Sorry for my delayed reply. Yes, I'd be happy to review that. Probably have the checks enabled by default, with the option to disable them.

mbardelmeijer commented 2 years ago

I would also love to see an option to disable this. Our use-case would be the same as @ankon but for a S3 storage driver :)

mholt commented 2 years ago

@mbardelmeijer That's good to know. But be aware that S3 is not safe to use as a storage backend by itself, since it cannot support atomic operations.

mholt commented 1 year ago

@mbardelmeijer @ankon I've implemented this in 5357574. Please feel free to try it out!

ankon commented 1 year ago

Thanks a lot!

We just rolled this out for a spin, and according to our database statistics this looks great -- no more rw_test_ keys in the "top contributors" diagrams at all. We're going to leave it running for a while now, and see whether overall statistics also show improvements.

mholt commented 1 year ago

Awesome, thanks for giving it a spin!