Closed ederlf closed 5 years ago
ah ! the joy of dynamic languages ... Are you using the API to announce some routes, possibly using next-hop self
?
I am announcing routes, but not using self.
The next hop is the same as the interface a session is running. This is how the announcements look like:
neighbor 20.1.12.2 announce route 140.0.0.1/24 next-hop 20.1.12.1 as-path [ 64604 ]
Could you please run exabgp with the -d
option.
Before a line such as
Jun 14 17:57:06 horse exabgp[12008]: reactor async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | problem with function
Jun 14 17:57:06 horse exabgp[12008]: reactor async | 5a1508fa-6ffc-11e8-8195-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
There should be another with the same UUID ( in that case 5a1508fa-6ffc-11e8-8195-024c6ac3d37f
) looking like
async | UUID | command
I would like you to look at the command
which caused the issue.
Also it is likely that not far previously in the log is the exact command when it was parsed.
This information would greatly help me.
The log that appears before the error is: check new connection.
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11 sending TCP payload ( 49) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0031 0104 FC59 00B4 0A01 0702 1402 0601 0400 0100 0102 0206 0002 0641 0400 00FC 59
Jun 26 07:06:27 horse exabgp[9536]: reactor async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | check new connection
Jun 26 07:06:27 horse exabgp[9536]: network connection from 10.1.1.1
Jun 26 07:06:27 horse exabgp[9536]: reactor async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | problem with function
Jun 26 07:06:27 horse exabgp[9536]: reactor async | 71da1a9a-790f-11e8-baa7-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11 >> OPEN version=4 asn=64601 hold_time=180 router_id=10.1.7.2 capabilities=[Multiprotocol(ipv4 unicast), Extended Message(65535), ASN4(64601)]
Jun 26 07:06:27 horse exabgp[9536]: incoming-8 connection to 10.1.1.1 closed
Jun 26 07:06:27 horse exabgp[9536]: incoming-8 incoming-8 10.1.1.2-10.1.1.1, closing connection
Jun 26 07:06:27 horse kernel: [295588.396182] device 3_2_1-inet-ext left promiscuous mode
Jun 26 07:06:27 horse exabgp[9526]: ka-incoming-1 send-timer 47 second(s) left
Jun 26 07:06:27 horse exabgp[9526]: ka-outgoing-1 receive-timer 165 second(s) left
Jun 26 07:06:27 horse exabgp[9526]: ka-outgoing-1 send-timer 47 second(s) left
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11 outgoing-11 10.1.1.2-10.1.1.1, closing connection
Jun 26 07:06:27 horse exabgp[9532]: outgoing-11 peer reset, message [closing connection] error[issue reading on the socket: [Errno ECONNRESET] [Errno 104] Connection reset by peer]
I traced it down to find where router_id was being accessed to find the None object. exabgp/reactor/loop.py#L243
self.async.schedule(str(uuid.uuid1()),'check new connection',self.listener.new_connections())
exabgp/reactor/listener.py##L184
denied = peer.handle_connection(connection)
exabgp/reactor/peer.py#L208
remote_id = self.proto.negotiated.received_open.router_id.pack()
self.proto.negotiated.received_open is the culprit.
/me is really confused ...
It should be impossible to have an FSM of OPENCONFIRM and a received_open
set as None
!!!
Looking at the code for quite some time and scratching my head, the only thing which would have mad sense would have been for the router-id to not be set in the neighbour but your configuration has it ... I added some extra checks for that case - to be safe in case the parser has an issue. When I tested the code told me that you had a duplicate peer, and I had to remove a peer definition.
Any way to get access to the test bed to experience the bug ?
@thomas-mangin I understand why you are puzzled. By simply looking the code I could not also find the reason it is none. But I can guarantee it is after printing it
if self.fsm == FSM.OPENCONFIRM:
# We cheat: we are not really reading the OPEN, we use the data we have instead
# it does not matter as the open message will be the same anyway
self.logger.debug("Received open type is %s" % type(self.proto.negotiated.received_open))
local_id = self.neighbor.router_id.pack()
remote_id = self.proto.negotiated.received_open.router_id.pack()
I get during a connection failure (7th line):
Jun 26 20:15:32 horse exabgp[16951]: outgoing-5 sending TCP payload ( 59) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 0200 0000 2040 0101 0040 0212 0204 0000 FC59 0000 FFFE 0000 FC5A 0000 FDED 4003 0414 0107 0118 C0A8 0D
Jun 26 20:15:32 horse exabgp[16972]: reactor async | add3dfd6-797d-11e8-bace-024c6ac3d37f | check new connection
Jun 26 20:15:32 horse exabgp[16951]: outgoing-5 >> 1 UPDATE(s)
Jun 26 20:15:32 horse exabgp[16972]: network connection from 20.1.14.1
Jun 26 20:15:32 horse exabgp[16951]: incoming-1 received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003F 02
Jun 26 20:15:32 horse exabgp[16951]: incoming-1 received TCP payload ( 44) 0000 0024 4001 0100 4002 1602 0500 00FD E900 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 0A01 0201 18C0 A80F
Jun 26 20:15:32 horse exabgp[16972]: Received open type is <type 'NoneType'>
Jun 26 20:15:32 horse exabgp[16951]: incoming-1 << message of type UPDATE
Jun 26 20:15:32 horse exabgp[16972]: reactor async | add3dfd6-797d-11e8-bace-024c6ac3d37f | problem with function
Jun 26 20:15:32 horse exabgp[16951]: parser parsing UPDATE ( 44) 0000 0024 4001 0100 4002 1602 0500 00FD E900 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 0A01 0201 18C0 A80F
Jun 26 20:15:32 horse exabgp[16972]: reactor async | add3dfd6-797d-11e8-bace-024c6ac3d37f | 'NoneType' object has no attribute 'router_id'
Jun 26 20:15:32 horse exabgp[16943]: incoming-3 received TCP payload ( 19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 02
Jun 26 20:15:32 horse exabgp[16943]: incoming-3 received TCP payload ( 40) 0000 0020 4001 0100 4002 1202 0400 00FC 5900 00FF FE00 00FC 5A00 00FD ED40 0304 1401 0701 18C0 A80D
I will try to provide an exact copy of my scenario in a VM.
Could you please also send me a full copy of your exabgp log (happy to get it per mail at first@last.com)
@ederlf Could you please update to master - re-run and give me again a copy of the full logs please ? I have added a little of debugging info in the log.
@ederlf ping ?
@thomas-mangin I've sent the logs last week directly to your e-mail. If you did not get it, I've copied it in paste bin.
This time I have saved individual logs per peer. It might be easier to see the problem without all the other peers' messages. https://pastebin.com/NnGy5pnr https://pastebin.com/3UKnYEBC
This is untested and just my intuition of what was wrong ... If you still mind could you please check ?
I tested and the error seems to be gone :-)
🕺 Thanks
To add another perspective, I’ve found this issue is still present with ExaBGP 4.0.10 as well as git master.
Like @ederlf, I’ve hit this while simulating a network of routers with sessions setup between multiple instances of ExaBGP. Perhaps this means it’s triggered in circumstances where the two peers are started at nearly the same time in an automated fashion.
Studying the code, I too am unable to see how received_open
can end up being set to None
, but it definitely is when this bug is hit.
Following a hunch from reading the code, I’ve found this only occurs when both peers are active and one of the two TCP connections between them needs to be discarded.
Therefore, a viable workaround for me was to generate the configs in way that ensures only one peer of each session is set to active and the other is passive, and I haven’t hit this bug since then. Perhaps this helps in debugging further. Other than that, I can’t provide further insight right now unfortunately.
@leonnnn thank you for the help pointing me toward the conflict resolution code. Could you please open an new issue ?
@leonnnn I performed a stress test of this code path and could not reproduce ... You can have multiple exabgp installed on a same machine. If you had for example the OS packaged version and later install with pip you could still have the old version running .. Could you make sure this is not the case before opening the issue please !
Also if you run exabgp with -p -d
the program will drop you in the python debugger and a backtrace would be most useful.
Summary
In my setup (A simulated data center), when trying to connect multiple exabgp instances, some connections fail and the following messages appear in the log near the notification of connection failure.
OS
Version
Installation
Environment
I am executing using the following env variables:
Configuration
The configuration is not the most simple, I'll try to give an overview. There are 20 Exabgp instances. They are interconnected like a Fat Tree network, exactly like the picture below. (Ignore the IP addressing though). The number of sessions is equal to the number of links: 36.
All follow the same configuration pattern, here is the configuration of a node in the edge.
Here is the configuration of the router at the aggregation level, connected to the above speaker.
Program output
Here follows the output. You can see that 10.1.12.2 is first trying to connect with 10.1.12.1 (router-id:192.168.13.254). Then the error comes in the 4th line. Next, the other router tries to establish a session that results in failure.
The problem also happens with other speakers. This is just an illustration from the two posted configuration files.
Steps to reproduce
The problems does not happen every time, and since it is not a simple setup it may not be that easy to test. If the current information is not enough to get a clue, I can provide the environment it is being executed along with the necessary python scripts.
Edit: Updated to correct configuration.