fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.27k stars 616 forks source link

TCP proxy broken since v1.5.8 #524

Closed sugdyzhekov closed 5 years ago

sugdyzhekov commented 6 years ago

Impact It's not possible to connect to MySQL since v1.5.8

STR

Expected result mysql client can establish connection to MySQL service

Actual result mysql client hangs during connection establishment.

Additional info v1.5.7

$ nc 10.0.0.10 3306
J
5.6.40G!
B\'J}[0z��9KM"^612FK%umysql_native_password

v1.5.8

$ nc 10.0.0.10 3306
<empty output, process hangs until enter press>
J
5.6.40G!
B\'J}[0z��9KM"^612FK%umysql_native_password

Configuration

proxy_addr

:3306;proto=tcp,:80,:22;proto=tcp,:23;proto=tcp,:8080,:1025;proto=tcp,:5672;proto=tcp,:5673;proto=tcp;cs=certstore,:443;cs=certstore,:8125;proto=tcp,:9000;cs=certstore,:27017;proto=tcp

mysql service register at Consul with following tag: urlprefix-:3306 proto=tcp

aaronhurt commented 6 years ago

@sugdyzhekov Interesting, could you attach a tcpdump in addition to the nc output?

magiconair commented 6 years ago

Could be this line: https://github.com/fabiolb/fabio/pull/423/files#diff-7c73f6a0cf2ff2671c80a43d0aad92a3R46

this is the sni proxy. sorry

pschultz commented 5 years ago

Just discovered this issue too (I assume it's the same). We noticed because our SMTP server seemed to be unresponsive.

If the server starts the conversation after a client connects (as opposed to the client sending the first message, which works fine), fabio doesn't send data to the client.

Both SMTP and MySQL servers start a connection by sending a banner first.

Here is my reproducer:

// smtp_mock.go
package main

import (
    "fmt"
    "log"
    "net"
)

func main() {
    log.Println("Listening on :8025")
    l, err := net.Listen("tcp", ":8025")
    if err != nil {
        log.Fatal(err)
    }

    for {
        conn, err := l.Accept()
        if err != nil {
            log.Fatal(err)
        }

        log.Println("serving", conn.RemoteAddr())
        fmt.Fprintf(conn, "220 localhost ESMTP Go\r\n")
        conn.Close()
    }
}
$ go run smtp_mock.go
2018/12/06 14:48:42 Listening on :8025

Next shell:

$ curl -XPUT localhost:8500/v1/agent/service/register -d@-<<<'
{
    "Address": "127.0.0.1",
    "Check": {
        "Interval": "5s",
        "TCP": "localhost:8025"
    },
    "ID": "09bf6151-243b-487e-a062-ea9741839818",
    "Name": "mock-smtpd",
    "Port": 8025,
    "Tags": [
        "dev-fabio-:8026"
    ]
}'
$ fabio \
    -registry.consul.tagprefix=dev-fabio- \
    -proxy.addr=':8026;proto=tcp' \
    -log.level=TRACE

Next shell:

# Connect directly
$ telnet localhost 8025
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 localhost ESMTP Go go1.11.2
Connection closed by foreign host.

# Connect via fabio
$ telnet localhost 8026
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
<hangs indefinitely>
asdf<enter>
220 localhost ESMTP Go go1.11.2
Connection closed by foreign host.

There's nothing interesting in fabio's logs, not even at TRACE level. Git bisect points fingers at a28d24440a78c441e189e90d8671332122fea90f:

a28d24440a78c441e189e90d8671332122fea90f is the first bad commit
commit a28d24440a78c441e189e90d8671332122fea90f
Author: Aaron Hurt
Date:   Mon Feb 12 21:24:00 2018 -0600

    add basic ip centric access control on routes

:040000 040000 8c2e3c4fd614bac972c3e5cf7f3c5d4403e4b462 3cdbf0b8f4e80bf5a4a0a6567227c936cb0cc77a M      docs
:040000 040000 e046f4421d2a958f63712cc59a843509ae486d86 5681ab0cef7610c12c29a35a6fb262d856f7a952 M      proxy
:040000 040000 34db75362cd6f9ab3a9125f8802e32f892bdc678 491cb45fa2072e457fec37e32cdb49cb1ae383b5 M      route
pschultz commented 5 years ago

Here's a tcpdump. Note the two second gap after the third packet. That's where telnet was connected to fabio, but I didn't do anything. Then I hit enter, and then fabio established a connection to the backend.

15:15:26.826545 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [S], seq 2552979650, win 43690, options [mss 65495,sackOK,TS val 3127993891 ecr 0,nop,wscale 7], length 0
15:15:26.826565 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [S.], seq 193147615, ack 2552979651, win 43690, options [mss 65495,sackOK,TS val 3127993891 ecr 3127993891,nop,wscale 7], length 0
15:15:26.826586 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [.], ack 1, win 342, options [nop,nop,TS val 3127993891 ecr 3127993891], length 0
15:15:28.529959 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [P.], seq 1:3, ack 1, win 342, options [nop,nop,TS val 3127995594 ecr 3127993891], length 2
15:15:28.529988 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [.], ack 3, win 342, options [nop,nop,TS val 3127995594 ecr 3127995594], length 0
15:15:28.530519 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [S], seq 2746406685, win 43690, options [mss 65495,sackOK,TS val 3127995595 ecr 0,nop,wscale 7], length 0
15:15:28.530559 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [S.], seq 2444991286, ack 2746406686, win 43690, options [mss 65495,sackOK,TS val 3127995595 ecr 3127995595,nop,wscale 7], length 0
15:15:28.530587 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [.], ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530775 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [P.], seq 1:25, ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 24
15:15:28.530805 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [.], ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530844 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [F.], seq 25, ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530846 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [P.], seq 1:3, ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 2
15:15:28.530898 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [R], seq 2444991311, win 0, length 0
15:15:28.531008 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [P.], seq 1:25, ack 3, win 342, options [nop,nop,TS val 3127995595 ecr 3127995594], length 24
15:15:28.531030 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [.], ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531097 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [F.], seq 25, ack 3, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531151 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [F.], seq 3, ack 26, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531178 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [.], ack 4, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
aaronhurt commented 5 years ago

Interesting and thank you for detailed troubleshooting. That makes a lot of sense. We use TCP proxy but haven't noticed any breakage but in our case it is all client initiated.

aaronhurt commented 5 years ago

... and it seems I may be to blame :)

@pschultz could you try reproducing this with master? I just made some changes to the ip access control that might be the solution. In short that section wasn't properly getting the remote connection address previously.

aaronhurt commented 5 years ago

digging into this now ... using your steps I can reproduce it on master

aaronhurt commented 5 years ago

It is indeed the AccessDenied check but I'm not yet sure why. Will continue to investigate but commenting out the relevant lines in proxy/tcp/tcp_proxy.go does make both test scenarios work.

aaronhurt commented 5 years ago

I found the issue but I'm going to need to do a bit of research ... steps so far:

Using the same test scenario as above with these code changes in route/access_rules.go:

// AccessDeniedTCP checks rules on the target for TCP proxy routes.
func (t *Target) AccessDeniedTCP(c net.Conn) bool {
    log.Printf("[DEBUG] in AccessDeniedTCP 0")
    var addr *net.TCPAddr
    var ok bool
    // validate remote address assertion
    if addr, ok = c.RemoteAddr().(*net.TCPAddr); !ok {
        log.Printf("[ERROR] failed to assert remote connection address for %s", t.Service)
        return false
    }
    log.Printf("[DEBUG] in AccessDeniedTCP 1 addr=%+v", addr)
    // check remote connection address
    if t.denyByIP(addr.IP) {
        return true
    }
    log.Printf("[DEBUG] in AccessDeniedTCP 2")
    // default allow
    return false
}

Watching the fabio logs when telnetting to port 8026 from a remote host on the LAN ...

2018/12/06 14:12:49 [DEBUG] in AccessDeniedTCP 0
2018/12/06 14:12:53 [DEBUG] in AccessDeniedTCP 1 addr=100.127.255.1:22900
2018/12/06 14:12:53 [DEBUG] in AccessDeniedTCP 2

The 0 logging line is printed right after telnet collects. The 1 logging line is printed as soon as I press <enter> on the remote machine. It appears that golang is blocking and not returning from the RemoteAddr() function until input is received from the client.

aaronhurt commented 5 years ago

This looks potentially related: https://github.com/golang/go/issues/12943

aaronhurt commented 5 years ago

This as well: https://github.com/armon/go-proxyproto/issues/1

aaronhurt commented 5 years ago

The issue seems to be in the proxy protocol implementation ...

// RemoteAddr returns the address of the client if the proxy
// protocol is being used, otherwise just returns the address of
// the socket peer. If there is an error parsing the header, the
// address of the client is not returned, and the socket is closed.
// Once implication of this is that the call could block if the
// client is slow. Using a Deadline is recommended if this is called
// before Read()
func (p *Conn) RemoteAddr() net.Addr {
    p.once.Do(func() {
        if err := p.checkPrefix(); err != nil && err != io.EOF {
            log.Printf("[ERR] Failed to read proxy prefix: %v", err)
        }
    })
    if p.srcAddr != nil {
        return p.srcAddr
    }
    return p.conn.RemoteAddr()
}

We would either have to disable proxy protocol or get the remote IP without checking for proxy protocol for server initiated requests.

aaronhurt commented 5 years ago

Commenting out the proxy protocol wrapper allows the test to proceed.

With ln = &proxyproto.Listener{Listener: ln} commented out in proxy/listen.go the tests work with the ip access checks.

Logging output from a telnet check against the mock server from a remote host on the LAN:

2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 0
2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 1 addr=100.127.255.1:22982
2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 2

How do we want to proceed?

aaronhurt commented 5 years ago

My initial thoughts:

  1. Make proxy protocol selectable per listener (default to false)
    proxy.addr = 1.2.3.4:443;proxyproto=tue, \
             1.2.3.5:3306;proto=tcp
  2. Skip access rule checking before requesting the remote address if there are not rules defined
  3. Update documentation to indicate that enabling proxyproto on a tcp listener with access or deny rules defined will break with server initiated services

I think making people explicitly enable proxy protocol is also better from a security perspective.

magiconair commented 5 years ago

So this is broken for TCP proxies which do not use the PROXY protocol and where the client is waiting for an initial handshake from the server which never comes b/c fabio does not forward the connection (i.e. MySQL)? And this started to become an issue when we added access control since we're looking at the RemoteAddr which is a blocking call in the ProxyListener? HTTP is not affected since this was fixed in the stdlib, AFAICT.

Making the PROXY protocol configurable is the better solution, IMO. (I should have done this when I added that.) People know when they are using it, one would assume. The question is what to do with the default. We can have on and off but both settings have drawbacks. Switching to off by default may break existing setups. Leaving it on by default means to keep it broken for users.

Who is affected if we switch the default to off? Users on AWS with access control, for example. If we disable PROXY and they upgrade then the protocol breaks since the server is getting the PROXY header. Or it may silently work since the server can handle the PROXY protocol.

The 1.5.8 behavior was that TCP just worked and I'd prefer if we can get back to that state.

Maybe we also add an auto mode as the new default, where we establish the connection, buffer the first 100 bytes from the client and check for the PROXY header in the TCP proxy directly to handle access control (if there is any).

However, keeping it simple also has advantages. If we decided to switch to off by default we should mitigate the impact:

pschultz commented 5 years ago

I'll see if I can put together a PoC for automatic detection.

That being said, if there is a toggle for PROXY Protocol this enables another use-case: checking that a connection is from a particular proxy (as opposed to a particular client downstream). Not sure how common that is, or if it's ever been asked for, though.

shantanugadgil commented 5 years ago

It would be good if the backend could decide if it wants to be enable PROXY or not.

I dunno if thats doable (single listener on Fabio sends PROXY to one backend but not PROXY to another backend)

pschultz commented 5 years ago

armon/go-proxyproto#4 adds a ProxyHeaderTimeout option for the listener. If there is no proxy header within that timeout, it falls back to plain TCP proxying.

After some superficial testing, updating goproxy to at least 49fdb5c and exposing that option seems to be a good compromise. We can have it default to something sensible (say, 5ms), and we can interpret zero or a negative value as "disable PROXY protocol support".

In addition we shouldn't call RemoteAddr() unless there are access rules defined.

This will make fabio work for MySQL in the default configuration, costing a slight increase in connect latency, for exactly those users that use access rules and no PROXY protocol.

There is also a chance of subtle breakage by choosing too small a timeout.

pschultz commented 5 years ago

If you want to play around, I just pushed to https://github.com/classmarkets/fabio/tree/fix-tcp (classmarkets/fabio@7246b924d892dc146ec0c3c8790045632225e71a)

aaronhurt commented 5 years ago

I made a few commits to a new branch referenced above that address the core points mentioned thus far. The current proxy-protocol toggle defaults to false but that may need to change per @magiconair comment's above.

Regardless of the default setting I believe additional documentation and release notes will be warranted.

magiconair commented 5 years ago

@pschultz I've invited you to be a collaborator so that we can work on the same branches.

magiconair commented 5 years ago

Please note that this change disabled PROXY protocol by default and you have to enable it if you need it. If you feel this is a grave mistake, the please let us know. (I'll mull over this a bit more before creating a new release).

pschultz commented 5 years ago

@pschultz I've invited you to be a collaborator so that we can work on the same branches.

@magiconair, thank you very much. Would you also add my SSH key, please?

ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFNuN8ltMS1Eg034DHD0fekzKSkdlzlWeSMbJy/LJxOYEUQMgNdtQ7GzDZUsArTObenJRKMnpyP6I2mcwesw0xM= pschultz
magiconair commented 5 years ago

@pschultz I've enabled your account to be a collaborator with write permissions. That should be sufficient. If you want to use a deploy key instead then it has to be unique for this repo since you cannot use the same key for multiple repos. Trying to add this key returns "key already in use".

pashinin commented 5 years ago

When will this change be released? Can you make 1.5.11?

magiconair commented 5 years ago

I'll try to release 1.5.11 tonight CET time.