Exa-Networks / exabgp

The BGP swiss army knife of networking
Other
2.09k stars 447 forks source link

'NoneType' object has no attribute 'router_id' log message during connection failure. #828

Closed ederlf closed 5 years ago

ederlf commented 6 years ago
Summary

In my setup (A simulated data center), when trying to connect multiple exabgp instances, some connections fail and the following messages appear in the log near the notification of connection failure.

Jun 14 17:57:06 horse exabgp[12008]: reactor       async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | problem with function
Jun 14 17:57:06 horse exabgp[12008]: reactor       async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
OS
Linux Ubuntu 16.04
uname -a
Linux horse 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64 x86_64 x86_64 GNU
Version
ExaBGP : 4.0.6-daa3d6ba
Python : 2.7.12 (default, Dec  4 2017, 14:50:18)  [GCC 5.4.0 20160609]
Uname  : Linux horse 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64
Root   : /usr/local
Installation
pip
Environment
The output is empty

I am executing using the following env variables:

env exabgp.daemon.daemonize=true exabgp.tcp.bind=<list_of_ports> exabgp.log.level=DEBUG exabgp.log.all=true exabgp.reactor.speed=1.0 exabgp.log.destination=syslog exabgp <config_file>
Configuration

The configuration is not the most simple, I'll try to give an overview. There are 20 Exabgp instances. They are interconnected like a Fat Tree network, exactly like the picture below. (Ignore the IP addressing though). The number of sessions is equal to the number of links: 36.

Foo

All follow the same configuration pattern, here is the configuration of a node in the edge.

process client{
    run /usr/bin/python /home/vagrant/horse/python/router_client.py 3_0_1;
    encoder json;
}

template {
    neighbor router {
        family {
            ipv4 unicast;
        }
        api speaking {
            processes [client];
            neighbor-changes;
            receive {
                parsed;
                update;
            }
        }
        local-as 65007;
        manual-eor true;
    }
}

neighbor 10.1.12.2 {
    inherit router;
    peer-as 64604;
    router-id 192.168.13.254;
    local-address 10.1.12.1;
    group-updates false;
    adj-rib-in false;
}

neighbor 10.1.13.2 {
    inherit router;
    peer-as 64604;
    router-id 192.168.13.254;
    local-address 10.1.13.1;
    group-updates false;
    adj-rib-in false;
}

Here is the configuration of the router at the aggregation level, connected to the above speaker.

process client{
    run /usr/bin/python /home/vagrant/horse/python/router_client.py 3_2_1;
    encoder json;
}

template {
    neighbor router {
        family {
            ipv4 unicast;
        }
        api speaking {
            processes [client];
            neighbor-changes;
            receive {
                parsed;
                update;
            }
        }
        local-as 64604;
        manual-eor true;
    }
}

neighbor 10.1.12.1 {
    inherit router;
    peer-as 65007;
    router-id 10.1.14.2;
    local-address 10.1.12.2;
    group-updates false;
    adj-rib-in false;
}

neighbor 10.1.14.1 {
    inherit router;
    peer-as 65008;
    router-id 10.1.14.2;
    local-address 10.1.14.2;
    group-updates false;
    adj-rib-in false;
}

neighbor 20.1.12.2 {
    inherit router;
    peer-as 65534;
    router-id 10.1.14.2;
    local-address 20.1.12.1;
    group-updates false;
    adj-rib-in false;
}

neighbor 20.1.13.2 {
    inherit router;
    peer-as 65534;
    router-id 10.1.14.2;
    local-address 20.1.13.1;
    group-updates false;
    adj-rib-in false;
}
Program output

Here follows the output. You can see that 10.1.12.2 is first trying to connect with 10.1.12.1 (router-id:192.168.13.254). Then the error comes in the 4th line. Next, the other router tries to establish a session that results in failure.

Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    attempting connection to 10.1.12.2:179
Jun 14 17:56:36 horse exabgp[12001]: reactor       async | 47de79fa-6ffc-11e8-8195-024c6ac3d37f | check new connection
Jun 14 17:56:36 horse exabgp[12001]: network       connection from 10.1.12.2
Jun 14 17:56:36 horse exabgp[12001]: reactor       async | 47de79fa-6ffc-11e8-8195-024c6ac3d37f | problem with function
Jun 14 17:56:36 horse exabgp[12001]: reactor       async | 47de79fa-6ffc-11e8-8195-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
Jun 14 17:56:36 horse exabgp[12001]: incoming-6    connection to 10.1.12.2 closed
Jun 14 17:56:36 horse exabgp[12001]: incoming-6    incoming-6 10.1.12.1-10.1.12.2, closing connection
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    sending TCP payload (  49) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0031 0104 FDEF 00B4 C0A8 0DFE 1402 0601 0400 0100 0102 0206 0002 0641 0400 00FD EF
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    >> OPEN version=4 asn=65007 hold_time=180 router_id=192.168.13.254 capabilities=[Multiprotocol(ipv4 unicast), Extended Message(65535), ASN4(65007)]
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    outgoing-4 10.1.12.1-10.1.12.2, closing connection
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    outgoing-4 10.1.12.1-10.1.12.2 10.1.12.2 lost TCP session with peer
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    peer reset, message [closing connection] error[the TCP connection was closed by the remote end]
Jun 14 17:56:36 horse exabgp[11997]: outgoing-4    outgoing-4 10.1.12.1-10.1.12.2, closing connection

The problem also happens with other speakers. This is just an illustration from the two posted configuration files.

Steps to reproduce

The problems does not happen every time, and since it is not a simple setup it may not be that easy to test. If the current information is not enough to get a clue, I can provide the environment it is being executed along with the necessary python scripts.

Edit: Updated to correct configuration.

thomas-mangin commented 6 years ago

ah ! the joy of dynamic languages ... Are you using the API to announce some routes, possibly using next-hop self ?

ederlf commented 6 years ago

I am announcing routes, but not using self.

The next hop is the same as the interface a session is running. This is how the announcements look like:

neighbor  20.1.12.2 announce route 140.0.0.1/24  next-hop 20.1.12.1  as-path [  64604 ]
thomas-mangin commented 6 years ago

Could you please run exabgp with the -d option.

Before a line such as

Jun 14 17:57:06 horse exabgp[12008]: reactor       async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | problem with function
Jun 14 17:57:06 horse exabgp[12008]: reactor       async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'

There should be another with the same UUID ( in that case 5a1508fa-6ffc-11e8-8195-024c6ac3d37f ) looking like async | UUID | command I would like you to look at the command which caused the issue.

Also it is likely that not far previously in the log is the exact command when it was parsed.

This information would greatly help me.

ederlf commented 6 years ago

The log that appears before the error is: check new connection.

Jun 26 07:06:27 horse exabgp[9532]: outgoing-11   sending TCP payload (  49) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0031 0104 FC59 00B4 0A01 0702 1402 0601 0400 0100 0102 0206 0002 0641 0400 00FC 59
Jun 26 07:06:27 horse exabgp[9536]: reactor       async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | check new connection
Jun 26 07:06:27 horse exabgp[9536]: network       connection from 10.1.1.1
Jun 26 07:06:27 horse exabgp[9536]: reactor       async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | problem with function
Jun 26 07:06:27 horse exabgp[9536]: reactor       async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11   >> OPEN version=4 asn=64601 hold_time=180 router_id=10.1.7.2 capabilities=[Multiprotocol(ipv4 unicast), Extended Message(65535), ASN4(64601)]
Jun 26 07:06:27 horse exabgp[9536]: incoming-8    connection to 10.1.1.1 closed
Jun 26 07:06:27 horse exabgp[9536]: incoming-8    incoming-8 10.1.1.2-10.1.1.1, closing connection
Jun 26 07:06:27 horse kernel: [295588.396182] device 3_2_1-inet-ext left promiscuous mode
Jun 26 07:06:27 horse exabgp[9526]: ka-incoming-1 send-timer 47 second(s) left
Jun 26 07:06:27 horse exabgp[9526]: ka-outgoing-1 receive-timer 165 second(s) left
Jun 26 07:06:27 horse exabgp[9526]: ka-outgoing-1 send-timer 47 second(s) left
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11   outgoing-11 10.1.1.2-10.1.1.1, closing connection
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11   peer reset, message [closing connection] error[issue reading on the socket: [Errno ECONNRESET] [Errno 104] Connection reset by peer]

I traced it down to find where router_id was being accessed to find the None object. exabgp/reactor/loop.py#L243

self.async.schedule(str(uuid.uuid1()),'check new connection',self.listener.new_connections())

exabgp/reactor/listener.py##L184

denied = peer.handle_connection(connection)

exabgp/reactor/peer.py#L208

remote_id = self.proto.negotiated.received_open.router_id.pack()

self.proto.negotiated.received_open is the culprit.

thomas-mangin commented 6 years ago

/me is really confused ... It should be impossible to have an FSM of OPENCONFIRM and a received_open set as None !!!

Looking at the code for quite some time and scratching my head, the only thing which would have mad sense would have been for the router-id to not be set in the neighbour but your configuration has it ... I added some extra checks for that case - to be safe in case the parser has an issue. When I tested the code told me that you had a duplicate peer, and I had to remove a peer definition.

Any way to get access to the test bed to experience the bug ?

ederlf commented 6 years ago

@thomas-mangin I understand why you are puzzled. By simply looking the code I could not also find the reason it is none. But I can guarantee it is after printing it

if self.fsm == FSM.OPENCONFIRM:
                        # We cheat: we are not really reading the OPEN, we use the data we have instead
                        # it does not matter as the open message will be the same anyway
                        self.logger.debug("Received open type is  %s" % type(self.proto.negotiated.received_open))
                        local_id = self.neighbor.router_id.pack()
                        remote_id = self.proto.negotiated.received_open.router_id.pack()  

I get during a connection failure (7th line):

Jun 26 20:15:32 horse exabgp[16951]: outgoing-5    sending TCP payload (  59) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 0200 0000 2040 0101 0040 0212 0204 0000 FC59 0000 FFFE 0000 FC5A 0000 FDED 4003 0414 0107 0118 C0A8 0D
Jun 26 20:15:32 horse exabgp[16972]: reactor       async | add3dfd6-797d-11e8-bace-024c6ac3d37f | check new connection
Jun 26 20:15:32 horse exabgp[16951]: outgoing-5    >> 1 UPDATE(s)
Jun 26 20:15:32 horse exabgp[16972]: network       connection from 20.1.14.1
Jun 26 20:15:32 horse exabgp[16951]: incoming-1    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003F 02
Jun 26 20:15:32 horse exabgp[16951]: incoming-1    received TCP payload (  44) 0000 0024 4001 0100 4002 1602 0500 00FD E900 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 0A01 0201 18C0 A80F
Jun 26 20:15:32 horse exabgp[16972]:               Received open type is  <type 'NoneType'>
Jun 26 20:15:32 horse exabgp[16951]: incoming-1    << message of type UPDATE
Jun 26 20:15:32 horse exabgp[16972]: reactor       async | add3dfd6-797d-11e8-bace-024c6ac3d37f | problem with function
Jun 26 20:15:32 horse exabgp[16951]: parser        parsing UPDATE (  44) 0000 0024 4001 0100 4002 1602 0500 00FD E900 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 0A01 0201 18C0 A80F
Jun 26 20:15:32 horse exabgp[16972]: reactor       async | add3dfd6-797d-11e8-bace-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
Jun 26 20:15:32 horse exabgp[16943]: incoming-3    received TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 02
Jun 26 20:15:32 horse exabgp[16943]: incoming-3    received TCP payload (  40) 0000 0020 4001 0100 4002 1202 0400 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 1401 0701 18C0 A80D

I will try to provide an exact copy of my scenario in a VM.

thomas-mangin commented 6 years ago

Could you please also send me a full copy of your exabgp log (happy to get it per mail at first@last.com)

thomas-mangin commented 6 years ago

@ederlf Could you please update to master - re-run and give me again a copy of the full logs please ? I have added a little of debugging info in the log.

thomas-mangin commented 6 years ago

@ederlf ping ?

ederlf commented 6 years ago

@thomas-mangin I've sent the logs last week directly to your e-mail. If you did not get it, I've copied it in paste bin.

This time I have saved individual logs per peer. It might be easier to see the problem without all the other peers' messages. https://pastebin.com/NnGy5pnr https://pastebin.com/3UKnYEBC

thomas-mangin commented 5 years ago

This is untested and just my intuition of what was wrong ... If you still mind could you please check ?

ederlf commented 5 years ago

I tested and the error seems to be gone :-)

thomas-mangin commented 5 years ago

🕺 Thanks

leonnnn commented 5 years ago

To add another perspective, I’ve found this issue is still present with ExaBGP 4.0.10 as well as git master.

Like @ederlf, I’ve hit this while simulating a network of routers with sessions setup between multiple instances of ExaBGP. Perhaps this means it’s triggered in circumstances where the two peers are started at nearly the same time in an automated fashion.

Studying the code, I too am unable to see how received_open can end up being set to None, but it definitely is when this bug is hit.

Following a hunch from reading the code, I’ve found this only occurs when both peers are active and one of the two TCP connections between them needs to be discarded.

Therefore, a viable workaround for me was to generate the configs in way that ensures only one peer of each session is set to active and the other is passive, and I haven’t hit this bug since then. Perhaps this helps in debugging further. Other than that, I can’t provide further insight right now unfortunately.

thomas-mangin commented 5 years ago

@leonnnn thank you for the help pointing me toward the conflict resolution code. Could you please open an new issue ?

thomas-mangin commented 5 years ago

@leonnnn I performed a stress test of this code path and could not reproduce ... You can have multiple exabgp installed on a same machine. If you had for example the OS packaged version and later install with pip you could still have the old version running .. Could you make sure this is not the case before opening the issue please ! Also if you run exabgp with -p -d the program will drop you in the python debugger and a backtrace would be most useful.