caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.69k stars 4.05k forks source link

v2.7.x: panic when TLS-SNI is received for an unknown domain #5680

Closed otbutz closed 1 year ago

otbutz commented 1 year ago

caddy v2.7.2 installed via the apt repository:

Aug 03 08:55:31 proxy caddy[1163492]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 03 08:55:31 proxy caddy[1163492]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x903750]
Aug 03 08:55:31 proxy caddy[1163492]: goroutine 57 [running]:
Aug 03 08:55:31 proxy caddy[1163492]: github.com/caddyserver/certmagic.(*Config).getCertDuringHandshake(0xc000543520, {0x1f09a88, 0xc000194008}, _, _)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:378 +0x1390
Aug 03 08:55:31 proxy caddy[1163492]: github.com/caddyserver/certmagic.(*Config).GetCertificateWithContext(0xc000543520, {0x1f09a88, 0xc000194008}, 0xc000543450)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:84 +0xbff
Aug 03 08:55:31 proxy caddy[1163492]: github.com/caddyserver/certmagic.(*Config).GetCertificate(0xc000138ee0?, 0xc0001dc1b0?)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:50 +0x2a
Aug 03 08:55:31 proxy caddy[1163492]: github.com/caddyserver/caddy/v2/modules/caddytls.(*ConnectionPolicy).buildStandardTLSConfig.func1(0xc000543450)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/caddyserver/caddy/v2@v2.7.2/modules/caddytls/connpolicy.go:232 +0x14f
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*config).getCertificate(0xc0008d7800, 0xc000543450)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/common.go:1086 +0x42
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).pickCertificate(0xc000631be8)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:415 +0x66
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).handshake(0xc000631be8)
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:60 +0x53
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*Conn).serverHandshake(0xc00053d180, {0x1f09a50, 0xc000562fa0})
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server.go:53 +0x188
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*Conn).handshakeContext(0xc00053d180, {0x1f09af8, 0xc0001753e0})
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1540 +0x3ce
Aug 03 08:55:31 proxy caddy[1163492]: github.com/quic-go/qtls-go1-20.(*Conn).HandshakeContext(0xc00005dfd0?, {0x1f09af8?, 0xc0001753e0?})
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1480 +0x25
Aug 03 08:55:31 proxy caddy[1163492]: created by github.com/quic-go/qtls-go1-20.(*QUICConn).Start
Aug 03 08:55:31 proxy caddy[1163492]:         github.com/quic-go/qtls-go1-20@v0.3.0/quic.go:179 +0xcf

Might be related to https://github.com/golang/go/issues/61639 ?

otbutz commented 1 year ago

@marten-seemann https://github.com/quic-go/quic-go/pull/4001

Animosity022 commented 1 year ago

I updated and tested 2.7.2 this morning had to roll back for this. I use a built version with CloudFlareDNS and the Security plugins only.

Aug 03 07:47:24 gemini systemd[1]: Started caddy.service - Caddy.
Aug 03 07:49:28 gemini caddy[27478]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 03 07:49:28 gemini caddy[27478]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x907cf0]
Aug 03 07:49:28 gemini caddy[27478]: goroutine 842 [running]:
Aug 03 07:49:28 gemini caddy[27478]: github.com/caddyserver/certmagic.(*Config).getCertDuringHandshake(0xc0010ee410, {0x29e8728, 0xc000138008}, _, _)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:378 +0x1390
Aug 03 07:49:28 gemini caddy[27478]: github.com/caddyserver/certmagic.(*Config).GetCertificateWithContext(0xc0010ee410, {0x29e8728, 0xc000138008}, 0xc0010ee270)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:84 +0xbff
Aug 03 07:49:28 gemini caddy[27478]: github.com/caddyserver/certmagic.(*Config).GetCertificate(0xc0000f0ee0?, 0xc000578360?)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:50 +0x2a
Aug 03 07:49:28 gemini caddy[27478]: github.com/caddyserver/caddy/v2/modules/caddytls.(*ConnectionPolicy).buildStandardTLSConfig.func1(0xc0010ee270)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/caddyserver/caddy/v2@v2.7.2/modules/caddytls/connpolicy.go:232 +0x14f
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*config).getCertificate(0xc001d9e480, 0xc0010ee270)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/common.go:1086 +0x42
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).pickCertificate(0xc0005fdbe8)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:415 +0x66
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).handshake(0xc0005fdbe8)
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:60 +0x53
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*Conn).serverHandshake(0xc000977500, {0x29e86f0, 0xc0005b24b0})
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server.go:53 +0x188
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*Conn).handshakeContext(0xc000977500, {0x29e8798, 0xc000c90b40})
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1540 +0x3ce
Aug 03 07:49:28 gemini caddy[27478]: github.com/quic-go/qtls-go1-20.(*Conn).HandshakeContext(0xbe592a?, {0x29e8798?, 0xc000c90b40?})
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1480 +0x25
Aug 03 07:49:28 gemini caddy[27478]: created by github.com/quic-go/qtls-go1-20.(*QUICConn).Start
Aug 03 07:49:28 gemini caddy[27478]:         github.com/quic-go/qtls-go1-20@v0.3.0/quic.go:179 +0xcf
Aug 03 07:49:28 gemini systemd[1]: caddy.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 03 07:49:28 gemini systemd[1]: caddy.service: Failed with result 'exit-code'.
Aug 03 07:51:03 gemini systemd[1]: Starting caddy.service - Caddy...
marten-seemann commented 1 year ago

This is where the nil pointer dereference happens: https://github.com/caddyserver/certmagic/blob/v0.19.1/handshake.go#L378

This should have been fixed by https://github.com/quic-go/quic-go/pull/4001. Did you make sure that this patch is included here (released as v0.37.1)?

francislavoie commented 1 year ago

Yes @marten-seemann we built and released with 0.37.1 https://github.com/caddyserver/caddy/blob/e2fc08bd34cc17b8cbb6ac364fa1ec41c4c643b9/go.mod#L20

$ caddy build-info | grep quic                                                                                     
dep github.com/quic-go/qpack    v0.4.0  h1:Cr9BXA1sQS2SmDUWjSofMPNKmvF6IiIfDRmgU0w1ZCo=
dep github.com/quic-go/qtls-go1-20  v0.3.0  h1:NrCXmDl8BddZwO67vlvEpBTwT89bJfKYygxv4HQvuDk=
dep github.com/quic-go/quic-go  v0.37.1 h1:M+mcsFq9KoxVjCetIwH65TvusW1UdRBc6zmxI6pkeD0=
francislavoie commented 1 year ago

Oh, we built with Go 1.20.6 though :thinking: did we need to build with 1.20.7?

Edit: Nevermind that shouldn't have mattered according to the changes in https://github.com/golang/go/issues?q=milestone%3AGo1.20.7

marten-seemann commented 1 year ago

Oh, we built with Go 1.20.6 though πŸ€” did we need to build with 1.20.7?

Edit: Nevermind that shouldn't have mattered according to the changes in https://github.com/golang/go/issues?q=milestone%3AGo1.20.7

That's a different issue for which I'm about to cut a quic-go patch release today (the RSA key size DoS vulnerability fixed in Go 1.20.7, which I need to backport on my crypto/tls fork). I suggested including this in the Caddy patch release in https://github.com/caddyserver/caddy/issues/5671#issuecomment-1663111895.

marten-seemann commented 1 year ago

We should probably figure out why this panic is occurring before I cut v0.37.2. Any hints?

Animosity022 commented 1 year ago

I installed my first version from a download from the caddy website with the extra modules selected:

[felix@phoenix caddy]$ ./caddy build-info | head -5
go  go1.20.1
path    caddy
mod caddy   (devel)
dep filippo.io/edwards25519 v1.0.0  h1:0wAIcmJUqRdI8IJ/3eGi5/HwXZWPujYXXlkrQogz0Ek=
dep github.com/AndreasBriese/bbloom v0.0.0-20190825152654-46b345b51c96  h1:cTp8I5+VIoKjsnZuH8vjyaysT/ses3EvZeaV/1UkF2M=

My other box has it built from xcaddy with the latest GO version installed, which was 1.20.7.

[felix@gemini caddy]$ ./caddy build-info  | head -5
go  go1.20.7
path    caddy
mod caddy   (devel)
dep filippo.io/edwards25519 v1.0.0  h1:0wAIcmJUqRdI8IJ/3eGi5/HwXZWPujYXXlkrQogz0Ek=
dep github.com/AndreasBriese/bbloom v0.0.0-20190825152654-46b345b51c96  h1:cTp8I5+VIoKjsnZuH8vjyaysT/ses3EvZeaV/1UkF2M=

I've been up for ~10 minutes now without issues and it would panic after 2-3 minutes for me anyway.

mholt commented 1 year ago

@animosity22 In your build-info output, what is the version of quic-go?

mholt commented 1 year ago

@otbutz @animosity22 Additionally, can you provide a minimal reproducer for this? At least your config and possibly a request that triggers it?

Animosity022 commented 1 year ago

Here are the two:

[felix@gemini caddy]$ ./caddy build-info | grep quic
dep github.com/quic-go/qpack    v0.4.0  h1:Cr9BXA1sQS2SmDUWjSofMPNKmvF6IiIfDRmgU0w1ZCo=
dep github.com/quic-go/qtls-go1-20  v0.3.0  h1:NrCXmDl8BddZwO67vlvEpBTwT89bJfKYygxv4HQvuDk=
dep github.com/quic-go/quic-go  v0.37.1 h1:M+mcsFq9KoxVjCetIwH65TvusW1UdRBc6zmxI6pkeD0=
[felix@gemini caddy]$ ./caddy version
v2.7.2 h1:QqThyoyUFAv1B7A2NMeaWlz7xmgKqU49PXBX08A+6xg=
[felix@gemini caddy]$

and

[felix@phoenix caddy]$ ./caddy build-info | head -5
go  go1.20.1
path    caddy
mod caddy   (devel)
dep filippo.io/edwards25519 v1.0.0  h1:0wAIcmJUqRdI8IJ/3eGi5/HwXZWPujYXXlkrQogz0Ek=
dep github.com/AndreasBriese/bbloom v0.0.0-20190825152654-46b345b51c96  h1:cTp8I5+VIoKjsnZuH8vjyaysT/ses3EvZeaV/1UkF2M=
[felix@phoenix caddy]$ ./caddy build-info | grep quic
dep github.com/quic-go/qpack    v0.4.0  h1:Cr9BXA1sQS2SmDUWjSofMPNKmvF6IiIfDRmgU0w1ZCo=
dep github.com/quic-go/qtls-go1-20  v0.1.0  h1:d1PK3ErFy9t7zxKsG3NXBJXZjp/kMLoIb3y/kV54oAI=
dep github.com/quic-go/quic-go  v0.32.0 h1:lY02md31s1JgPiiyfqJijpu/UX/Iun304FI3yUqX7tA=
[felix@phoenix caddy]$ ./caddy version
v2.6.4 h1:2hwYqiRwk1tf3VruhMpLcYTg+11fCdr8S3jhNAdnPy8=

My config is small I think as I only use the security plugin to do SSO auth and that protects my site.

[felix@phoenix caddy]$ cat Caddyfile
{
    email {env.EMAIL}
    storage file_system {
        root /opt/caddy/ssl
    }
    order authenticate before respond
    order authorize before basicauth

    security {
        oauth identity provider github {env.GITHUB_CLIENT_ID} {env.GITHUB_CLIENT_SECRET}

        authentication portal myportal {
            crypto default token lifetime 604800
            crypto key sign-verify {env.JWT_SHARED_KEY}
            cookie domain blah.us
            cookie lifetime 604800
            enable identity provider github
            ui {
                links {
                    "My Identity" "/whoami" icon "las la-user"
                }
            }

            transform user {
                match realm github
                match sub github.com/user1234
                action add role authp/admin
            }
        }

        authorization policy mypolicy {
            set auth url https://auth.blah.us/oauth2/github
            crypto key verify {env.JWT_SHARED_KEY}
            allow roles authp/admin authp/user
            validate bearer header
            inject headers with claims
        }
    }
}

auth.blah.us {
    tls {
        dns cloudflare {env.CLOUDFLARE_API_TOKEN}
        resolvers 1.1.1.1
    }
    authenticate with myportal
}

I wasn't doing anything specific to generate as the automatic SSL renewal would kick in and that would produce the error as it happened right after startup.

mholt commented 1 year ago

@animosity22 Thanks. The phoenix instance is clearly using too-old of a quic-go version for some reason: dep github.com/quic-go/quic-go v0.32.0. It needs to be on 0.37.1 or newer. (EDIT: Oh, but it's also running Caddy 2.6.4, not 2.7. So not relevant? Unless are you seeing this bug in 2.6.4 as well?)

Do they both have the same config? I wonder if a plugin is causing the downgrade.

But then again, @otbutz says he gets the error using Caddy from the apt repository... (meaning without plugins) :thinking: We are also deploying that Caddy on our website and I can't find this error in our logs at all.

Animosity022 commented 1 year ago

Oh sorry, one second, I forgot I downgraded my phoenix back back to the older version, let me update it and reproduce.

m4r-v1n commented 1 year ago

Just installed 2.7.2 on Debian 12 via this repo: https://dl.cloudsmith.io/public/caddy/stable/deb/debian

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x903750]
goroutine 28 [running]:
github.com/caddyserver/certmagic.(*Config).getCertDuringHandshake(0xc0003cf6c0, {0x1f09a88, 0xc000042058}, _, _)
        github.com/caddyserver/certmagic@v0.19.1/handshake.go:378 +0x1390
github.com/caddyserver/certmagic.(*Config).GetCertificateWithContext(0xc0003cf6c0, {0x1f09a88, 0xc000042058}, 0xc0003cf5f0)
        github.com/caddyserver/certmagic@v0.19.1/handshake.go:84 +0xbff
github.com/caddyserver/certmagic.(*Config).GetCertificate(0xc0007f2ee0?, 0xc000855620?)
        github.com/caddyserver/certmagic@v0.19.1/handshake.go:50 +0x2a
github.com/caddyserver/caddy/v2/modules/caddytls.(*ConnectionPolicy).buildStandardTLSConfig.func1(0xc0003cf5f0)
        github.com/caddyserver/caddy/v2@v2.7.2/modules/caddytls/connpolicy.go:232 +0x14f
github.com/quic-go/qtls-go1-20.(*config).getCertificate(0xc000003680, 0xc0003cf5f0)
        github.com/quic-go/qtls-go1-20@v0.3.0/common.go:1086 +0x42
github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).pickCertificate(0xc000867be8)
        github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:415 +0x66
github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).handshake(0xc000867be8)
        github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server_tls13.go:60 +0x53
github.com/quic-go/qtls-go1-20.(*Conn).serverHandshake(0xc0003e7180, {0x1f09a50, 0xc00003f130})
        github.com/quic-go/qtls-go1-20@v0.3.0/handshake_server.go:53 +0x188
github.com/quic-go/qtls-go1-20.(*Conn).handshakeContext(0xc0003e7180, {0x1f09af8, 0xc0005c08d0})
        github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1540 +0x3ce
github.com/quic-go/qtls-go1-20.(*Conn).HandshakeContext(0xc00041f7d0?, {0x1f09af8?, 0xc0005c08d0?})
        github.com/quic-go/qtls-go1-20@v0.3.0/conn.go:1480 +0x25
created by github.com/quic-go/qtls-go1-20.(*QUICConn).Start
        github.com/quic-go/qtls-go1-20@v0.3.0/quic.go:179 +0xcf

my Caddyfile:

{
    email redacted@example.com
}

(cors) {
    @cors_preflight method OPTIONS
    @cors header Origin {args.0}

    handle @cors_preflight {
        header Access-Control-Allow-Origin {args.0}
        header Access-Control-Allow-Methods "GET"
        header Access-Control-Allow-Headers "X-Requested-With"
        header Access-Control-Max-Age 3600
        respond "" 204
    }

    handle @cors {
        header Access-Control-Allow-Origin {args.0}
    }
}

(global-headers) {
    header {
        -Server
        Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
        X-Content-Type-Options nosniff
        X-Frame-Options SAMEORIGIN
    }
}

(php) {
    php_fastcgi unix//run/php/php8.1-fpm.sock
}

import sites/*

and an example sites/site:

example.com {

    root * /srv/www/example/public
    file_server
    encode zstd gzip

    import global-headers

}
Animosity022 commented 1 year ago

Hmm, I've tried a bit more on the reproducing the issue side with a few scenarios.

@raregems-io - if you restart a few times, does it clear up?

mholt commented 1 year ago

Thanks for the info. I'll try to reproduce it.

I should point out that our production website and the forums are both using 2.7.2 with similar configs without issues.

marten-seemann commented 1 year ago

lmk if there’s anything you need from the quic-go side. I’ll be holding off on cutting the RSA patch release until we’ve reached a conclusion here.

Animosity022 commented 1 year ago

Yeah, I hate bugs like this as it happened to me twice on one box and zero times on my other box. Same build process, identical config minus 2 different site names.

Aug  3 07:45:49 gemini caddy[27228]: panic: runtime error: invalid memory address or nil pointer dereference
Aug  3 07:49:28 gemini caddy[27478]: panic: runtime error: invalid memory address or nil pointer dereference

and since those 2, I've tried restarting ~30 times and I can't get it to reproduce.

mholt commented 1 year ago

@marten-seemann Thanks. Really appreciate your collaboration and patience. Are there any other code paths that could allow that Conn() or RemoteAddr() to be nil? Given the spurious nature of the reports, it seems like there's either some nondeterminism here or some other code path that we didn't think of that causes them to not be populated.

bt90 commented 1 year ago

Is that codepath always used or only if a on-demand certificate is created? That would explain why the error isn't that common as the mechanism for HTTP2 and 1.1 isn't affected.

mholt commented 1 year ago

@raregems-io I am unable to reproduce the issue with that config, using HTTP/3, and trying in a loop of about a thousand requests... Tried restarting the server several times and still can't reproduce it.

RainmakerRaw commented 1 year ago

I seem to have this issue also, but missed this existing one (sorry!). See #5681 and merge if desired. Apologies!

RainmakerRaw commented 1 year ago

@mholt How can I help you reproduce this? It happens every single time I visit the URL without fail. Restart the caddy.service and it's fine (and serving the other sites without issue), but visit Plex or Jellyfin and boom, issue recurs again and again. I'm happy to liaise with you if you want to let me know how I can help?

e: I notice the site loads OK in Firefox, no issues. Dev tools shows some items using http 1.1 and others (over half the resources) are loaded with http3. Using Brave, the site refuses to load at all and the Caddy server crashes. I wonder if the agent you're testing from is having an impact (or lack of)?

mholt commented 1 year ago

@RainmakerRaw I too have a Jellyfin installation. I can try proxying to it? But I feel like it shouldn't matter what kind of content we're serving, since the bug occurs before connections are even established.

e: I notice the site loads OK in Firefox, no issues. Dev tools shows some items using http 1.1 and others (over half the resources) are loaded with http3. Using Brave, the site refuses to load at all and the Caddy server crashes. I wonder if the agent you're testing from is having an impact (or lack of)?

If that's the case, then why aren't we seeing this on our production caddyserver.com and caddy.community sites?

The bug does seem sporadic though. @marten-seemann I dunno if any of this info is helpful, but I'm starting to wonder if there's another code path we missed. I'm not too familiar with the quic-go library though to know.

RainmakerRaw commented 1 year ago

@mholt Aha! I don't know what effect this has, but I can bypass the issue and make Brave work again. On my test desktop client, Firefox was using DoH by default (my own AdGuard Home installation on a VPS). Brave was set to use the local DNS server, and causes Caddy to crash. When I set Brave to also use offsite DoH the Jellyfin server loads fine again and Caddy doesn't crash. I'll be honest, I don't know what this means, but it's progress maybe?

RainmakerRaw commented 1 year ago

Further digging: Cloudflare has enabled ECH (encrypted client hello) on my domain. The Caddy server (and Jellyfin server) are running on the same LAN as my test desktop device - i.e. I'm hairpinning back in through the router to connect.

The local DNS server and the remote DoH server are both AdGuard Home using the same encrypted upstreams, but I wonder if something's getting mangled and causing the crash? I can connect to Plex and Jellyfin (using their domain not the local address) perfectly now I have enabled DoH in the browser. Both browsers are set to use ECH where available - is yours?

In summary, ECH is enabled on both browsers. With DoH and ECH enabled, Firefox connects to Jellyfin fine and Caddy doesn't crash. With Brave, ECH being enabled but DoH not causes Caddy to crash and no page loads. Adding DoH to the ECH on Brave makes everything work OK again.

Edit: Disabling ECH in the browser allows the Jellyfin site to load over QUIC with or without DoH enabled. I reloaded dozens of times without incident and Caddy no longer crashes. This seems to be ECH related.

marten-seemann commented 1 year ago

@marten-seemann I dunno if any of this info is helpful, but I'm starting to wonder if there's another code path we missed. I'm not too familiar with the quic-go library though to know.

Good thinking! In fact there is: tls.Config.GetCertificate also passes in a ClientHelloInfo. I feel really stupid for missing this πŸ˜”. For some reason I thought that tls.Config.GetConfigForClient would be the only place, and didn't even bother checking. This means we'll need to wrap GetCertificate as well, equivalent to what we do in https://github.com/quic-go/quic-go/pull/4001/files#diff-c1eebd1c52f66da524b99c57f22669500f22784ae80475fbf2b786bc0d1ba278.

Fix coming later today. This will be included in the v0.37.2 patch release.

RainmakerRaw commented 1 year ago

@marten-seemann This ties in with my last edit above. ECH was enabled in Brave for me, and I needed to also enable DoH to not have Caddy crash. Enabling ECH + DoH meant no crash and the page loads. Disabling ECH in the browser means the page loads (and Caddy doesn't crash) regardless of DoH status.

mholt commented 1 year ago

@marten-seemann Oh, you are awesome. Thank you so much. I don't know if I would have been able to figure that out. We're really lucky to have you building and maintaining that lib!

Once we merge the fix I would like as many people involved here to confirm that it works if you could please. :sweat_smile:

marten-seemann commented 1 year ago

Fix is ready at https://github.com/quic-go/quic-go/pull/4014.

mholt commented 1 year ago

You're a wizard Marten. Thanks so much for the patch!

mholt commented 1 year ago

@otbutz @bt90 @animosity22 @raregems-io @RainmakerRaw Since I was never able to reproduce this, could you all try it out? I've pushed a commit that upgrades quic-go with the patch to master, so if you build caddy from source or use xcaddy like xcaddy build master ... you should be able to verify whether this works. There are also CI artifacts you can download and try.

As soon as we get some confirmation, I'll tag One. More. Release.

Thank you!

RainmakerRaw commented 1 year ago

@mholt Sorry Matt it's 2am here, I only just saw this by chance. Compiling now...

RainmakerRaw commented 1 year ago

@mholt bad news:

caddy version
v2.7.3-0.20230804004403-51b1bfb12505 h1:bOvmG2P3zkUCfK+uPZUFTnTYmdI3MZ3fDSKms0jgWtc=
sudo journalctl -xeu caddy
β–‘β–‘
β–‘β–‘ The job identifier is 6612.
Aug 04 02:32:54 rock5b caddy[11574]: panic: runtime error: invalid memory address or nil pointer dereference
Aug 04 02:32:54 rock5b caddy[11574]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x47b1dc]
Aug 04 02:32:54 rock5b caddy[11574]: goroutine 167 [running]:
Aug 04 02:32:54 rock5b caddy[11574]: github.com/caddyserver/certmagic.(*Config).getCertDuringHandshake(0x400086c820, {0x1963238, 0x40001a6008}, _, _)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:378 +0xf8c
Aug 04 02:32:54 rock5b caddy[11574]: github.com/caddyserver/certmagic.(*Config).GetCertificateWithContext(0x400086c820, {0x1963238, 0x40001a6008}, 0x400086c7>
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:84 +0x9c4
Aug 04 02:32:54 rock5b caddy[11574]: github.com/caddyserver/certmagic.(*Config).GetCertificate(0x40009fe380?, 0x4000734738?)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/caddyserver/certmagic@v0.19.1/handshake.go:50 +0x30
Aug 04 02:32:54 rock5b caddy[11574]: github.com/caddyserver/caddy/v2/modules/caddytls.(*ConnectionPolicy).buildStandardTLSConfig.func1(0x400086c750)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/caddyserver/caddy/v2@v2.7.3-0.20230804004403-51b1bfb12505/modules/caddytls/connpolicy.go:232 +0x144
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*config).getCertificate(0x4000780600, 0x400086c750)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/common.go:1086 +0x44
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).pickCertificate(0x4000621bb0)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/handshake_server_tls13.go:415 +0x5c
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*serverHandshakeStateTLS13).handshake(0x4000621bb0)
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/handshake_server_tls13.go:60 +0x40
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*Conn).serverHandshake(0x40007a1180, {0x1963200, 0x40002ae460})
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/handshake_server.go:53 +0x138
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*Conn).handshakeContext(0x40007a1180, {0x19632a8, 0x400001bcb0})
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/conn.go:1540 +0x348
Aug 04 02:32:54 rock5b caddy[11574]: github.com/quic-go/qtls-go1-20.(*Conn).HandshakeContext(0x0?, {0x19632a8?, 0x400001bcb0?})
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/conn.go:1480 +0x24
Aug 04 02:32:54 rock5b caddy[11574]: created by github.com/quic-go/qtls-go1-20.(*QUICConn).Start
Aug 04 02:32:54 rock5b caddy[11574]:         github.com/quic-go/qtls-go1-20@v0.3.1/quic.go:179 +0xc8
Aug 04 02:32:54 rock5b systemd[1]: caddy.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
β–‘β–‘ Subject: Unit process exited
β–‘β–‘ Defined-By: systemd
β–‘β–‘ Support: https://www.debian.org/support
β–‘β–‘
β–‘β–‘ An ExecStart= process belonging to unit caddy.service has exited.
β–‘β–‘
β–‘β–‘ The process' exit code is 'exited' and its exit status is 2.
Aug 04 02:32:54 rock5b systemd[1]: caddy.service: Failed with result 'exit-code'.
β–‘β–‘ Subject: Unit failed
β–‘β–‘ Defined-By: systemd
β–‘β–‘ Support: https://www.debian.org/support
β–‘β–‘
β–‘β–‘ The unit caddy.service has entered the 'failed' state with result 'exit-code'.
mholt commented 1 year ago

@marten-seemann Dang, there's still a code path we're missing.

@RainmakerRaw Thank you so much for trying it out. Sorry to keep you up :sweat_smile: Get some good sleep, hopefully we'll have it figured out soon.

RainmakerRaw commented 1 year ago

Reverting to v2.6.4 in the interim. You'll get there guys, thanks for all your hard work!

marten-seemann commented 1 year ago

@marten-seemann Dang, there's still a code path we're missing.

Where? I don't see it. There's only the two code paths that I fixed: https://pkg.go.dev/crypto/tls (search for ClientHelloInfo).

RainmakerRaw commented 1 year ago

@mholt @marten-seemann Guys I didn't go to sleep (yet)... I got 2.7.3 working by changing a setting in Cloudflare. Just about to test 2.7.2 also just in case. I clearly am missing something here. To stop this being more of an XY problem:

My domain is registered at Cloudflare. I have a free CF account and so can't proxy Plex, Jellyfin etc per ToS. DNS at Cloudflare is set like this:

A domain.com 1.2.3.4 # My home WAN IP, proxied by Cloudflare)
CNAME jellyfin.domain.com domain.com (**not** proxied by Cloudflare as it's video)

Cloudflare has enabled ECH on my domain. Firefox and Brave are both set up to use ECH where available, and both are using DoH to my own off-prem server (AdGuard Home on a VPS). When connecting to jellyfin.domain.com in Firefox, everything works. When connecting using Brave, Caddy crashes.

However, I have now discovered that setting the CNAME for Jellyfin (and Plex) to temporarily also be proxied by Cloudflare (no video served, just to test the home page loading) then everything works, even in Brave! No crash on Caddy.

Is this helpful?

marten-seemann commented 1 year ago

The Go standard library doesn't even support ECH. It seems like debugging thing 2 layers higher than is useful, I think it would be more helpful to figure out which code path is taken that leads to the panic.

RainmakerRaw commented 1 year ago

@marten-seemann I see. My sincere apologies then. I just noticed that I can make Caddy crash or work by enabling and disabling ECH in the browser, or by toggling proxying in CF. As it turns out, changing Jellyfin's and Plex's CNAME for an A record in CF has now stopped Caddy crashing at all, so far. Everything works OK in all tested browsers and v2.7.3 doesn't crash any more. I guess it's a coincidence, you certainly would know more than me, I'm just a (new) user with a enough *nix knowledge to be dangerous haha.

I think it would be more helpful to figure out which code path is taken that leads to the panic.

In that case I'm definitely of no use to you. Sorry for cluttering up the thread (and pinging you unnecessarily). Best of luck solving it, I'll keep an eye out for updates. Best wishes.

mholt commented 1 year ago

It's weird that some deployments don't see it at all (like ours), but others ONLY see this. I can't figure it out.

francislavoie commented 1 year ago

Okay! I have a way to replicate it.

Simple config:

{
    debug
    admin off
    http_port 8881
}

https://localhost:8882 {
    respond "foo"
}

Run with ./caddy run --config Caddyfile as normal.

(I'm using non-standard ports and turning off admin because I have another Caddy instance running already)

Making a request for the wrong SNI (i.e. foo.localhost when Caddy only has localhost) with this command (container with a custom curl with http3 support, because it's the easiest way for me to try http3 reliably):

$ docker run -it --rm --net=host ymuski/curl-http3 curl -vsk -o/dev/null --http3-only https://foo.localhost:8882

@marten-seemann hopefully that should be enough for you to be able to do some fmt.Printf() tracing and such?

Stacktrace with Go 1.21 if it helps at all (i.e. using stdlib TLS stack instead of qtls):

goroutine 82 [running]:
github.com/caddyserver/certmagic.(*Config).getCertDuringHandshake(0xc0001db380, {0x1e4e970, 0x2b323a0}, _, _)
    /home/francis/go/pkg/mod/github.com/caddyserver/certmagic@v0.19.1/handshake.go:378 +0x1340
github.com/caddyserver/certmagic.(*Config).GetCertificateWithContext(0xc0001db380, {0x1e4e970, 0x2b323a0}, 0xc0001db110)
    /home/francis/go/pkg/mod/github.com/caddyserver/certmagic@v0.19.1/handshake.go:84 +0xbc5
github.com/caddyserver/certmagic.(*Config).GetCertificate(...)
    /home/francis/go/pkg/mod/github.com/caddyserver/certmagic@v0.19.1/handshake.go:50
github.com/caddyserver/caddy/v2/modules/caddytls.(*ConnectionPolicy).buildStandardTLSConfig.func1(0xc0001db110)
    /home/francis/repos/caddy/modules/caddytls/connpolicy.go:232 +0x170
crypto/tls.(*Config).getCertificate(0xc0004e2680, 0xc0001db110)
    /home/francis/sdk/gotip/src/crypto/tls/common.go:1116 +0x3b
crypto/tls.(*serverHandshakeStateTLS13).pickCertificate(0xc000735bf8)
    /home/francis/sdk/gotip/src/crypto/tls/handshake_server_tls13.go:435 +0x314
crypto/tls.(*serverHandshakeStateTLS13).handshake(0xc000735bf8)
    /home/francis/sdk/gotip/src/crypto/tls/handshake_server_tls13.go:59 +0x53
crypto/tls.(*Conn).serverHandshake(0xc000216e00, {0x1e4e8c8, 0xc000a04190})
    /home/francis/sdk/gotip/src/crypto/tls/handshake_server.go:53 +0x185
crypto/tls.(*Conn).handshakeContext(0xc000216e00, {0x1e4e890, 0xc000130210})
    /home/francis/sdk/gotip/src/crypto/tls/conn.go:1547 +0x3d3
crypto/tls.(*Conn).HandshakeContext(0xc0005077d0?, {0x1e4e890?, 0xc000130210?})
    /home/francis/sdk/gotip/src/crypto/tls/conn.go:1487 +0x1d
created by crypto/tls.(*QUICConn).Start in goroutine 69
    /home/francis/sdk/gotip/src/crypto/tls/quic.go:177 +0xc9
francislavoie commented 1 year ago

I have a possible patch (seems to work!) but it doesn't feel correct to me and I don't know if there's terrible side-effects, because this affects not only H3 but also H1/H2 when the change is made here:

diff --git a/modules/caddytls/connpolicy.go b/modules/caddytls/connpolicy.go
index a4dc4119..524fa36e 100644
--- a/modules/caddytls/connpolicy.go
+++ b/modules/caddytls/connpolicy.go
@@ -115,7 +115,13 @@ func (cp ConnectionPolicies) TLSConfig(_ caddy.Context) *tls.Config {
                                                continue policyLoop
                                        }
                                }
-                               return pol.TLSConfig, nil
+                               cfg := pol.TLSConfig.Clone()
+                               gc := cfg.GetCertificate
+                               cfg.GetCertificate = func(clientHello *tls.ClientHelloInfo) (*tls.Certificate, error) {
+                                       clientHello.Conn = hello.Conn
+                                       return gc(clientHello)
+                               }
+                               return cfg, nil
                        }

                        return nil, fmt.Errorf("no server TLS configuration available for ClientHello: %+v", hello)
marten-seemann commented 1 year ago

@francislavoie Have you figured out why it's panicking without this change?

francislavoie commented 1 year ago

Not quite.

Basically, we have the concept of "connection policies" (i.e. Caddy config which chooses a TLS config to use with "matchers", e.g. depending on SNI or client certs, etc) and quic-go doesn't see those underlying TLS configs. Our GetConfigForClient returns one of those configs.

The GetConfigForClient does have the correct hello.Conn, so we can wrap it ourselves and pass it down and that seems to fix it.

But what I don't know is why the underlying GetCertificate gets a "wrong" hello (with the conn missing). What else is creating/managing/cloning a hello? I don't think we do anything like that in Caddy, so I feel like that must be something in quic-go or stdlib.

Edit: Anyway you should be able to trace this pretty easily with a replace github.com/quic-go/quic-go => ../quic-go in Caddy's go.mod and just cd cmd/caddy && go build. I don't know how much else I can figure out, I'm feeling out of my depth here.

marten-seemann commented 1 year ago

Are you saying that GetConfigForClient might return a tls.Config, and then the GetCertificate callback is called on that config? Does quic-go need to handle this recursive case?

francislavoie commented 1 year ago

Are you saying that GetConfigForClient might return a tls.Config

Yes; that is the point of that callback! :sweat_smile: It returns a config that is appropriate given the hello (i.e. different config per SNI)

and then the GetCertificate callback is called on that config?

Yeah, but I don't know who calls this. I think stdlib does, according to the stacktrace, with crypto/tls.(*serverHandshakeStateTLS13).pickCertificate (in Go 1.21; qtls probably does something similar?)

Does quic-go need to handle this recursive case?

I really don't know whose job it is to handle this. I don't know where the hello struct is coming from.

mholt commented 1 year ago

I'm off to sleep for the night, but in the morning I may try a git bisect to see where this started happening and see if that helps us pinpoint it some more.

But yeah, GetConfigForClient is a case where the initial TLS config is only used to get a different config based on the properties of the handshake.

However, I don't believe it would be called more than once per handshake :thinking:

Thanks everyone for continuing to help me with this :pray: Definitely can't solve this one by myself.

WeidiDeng commented 1 year ago

quic-go can't handle this recursive call without forking stdlib. I can modify caddy to handle this situation, but the problem is the nil net.Conn.

WeidiDeng commented 1 year ago

@otbutz @animosity22 @raregems-io Can you try xcaddy build fix-quic-nil-conn to see if it's fixed?