bgp / stayrtr

RPKI-To-Router server implementation in Go
BSD 3-Clause "New" or "Revised" License
85 stars 13 forks source link

Error Response message not sent #90

Closed cjeker closed 1 year ago

cjeker commented 1 year ago

There is a race in StayRTR that results in missing Error Response PDUs because the RTR connection is closed before the message is sent.

I noticed this while working on RTR ASPA support. Since my OpenBGPD opens the connection with version 2 an Error Response with Code 4 should be sent by StayRTR but most of the time the connection is closed before this happens.

Unexpected Exchange:

rtr stayRTR: RTR RetryTimer triggered
rtr stayRTR: state change idle -> idle, reason: connection open
rtr stayRTR: state change idle -> closed, reason: connection closed

Expected exchange which happens from time to time:

rtr stayRTR: RTR RetryTimer triggered
rtr stayRTR: state change idle -> idle, reason: connection open
rtr stayRTR: received error: Unsupported Protocol Version: Bad protocol version
rtr stayRTR: state change idle -> closed, reason: connection closed with reset
benjojo commented 1 year ago

What is the fastest way to reproduce this? what PDU did you send?

cjeker commented 1 year ago

Open connection send either Reset Query or Serial Query with a version == 2.

StayRTR logs:

INFO[6099] Accepted tcp connection from 127.0.0.1:43145 (1/0) 
DEBU[6099] 127.0.0.1:43145 (v0) / Serial: 0: has bad version (received: v2, current: v0) error 
INFO[6099] Disconnecting client 127.0.0.1:43145 (v0) / Serial: 0 
DEBU[6099] 127.0.0.1:43145 (v0) / Serial: 0: Received PDU Reset Query v2 
DEBU[6099] 127.0.0.1:43145 (v0) / Serial: 0 > Request Cache 
benjojo commented 1 year ago

@cjeker Can you test on HEAD/Master to see if the commit 8a3a71e045c38e0a4e8cf70794900190ca9f9ce3 fixes this issue?

cjeker commented 1 year ago

I changed my OpenBGPD to run with RTR version 10. The exchange works and correctly goes down all the way to 2.

INFO[0004] Accepted tcp connection from 127.0.0.1:14260 (1/0) 
DEBU[0004] 127.0.0.1:14260 (v0) / Serial: 0: has bad version (received: v4, current: v0) error 
INFO[0004] Disconnecting client 127.0.0.1:14260 (v0) / Serial: 0 
DEBU[0004] 127.0.0.1:14260 (v0) / Serial: 0: Received PDU Reset Query v4 
DEBU[0004] 127.0.0.1:14260 (v0) / Serial: 0 > Request Cache 
DEBU[0004] 127.0.0.1:14260 (v0) / Serial: 0 < No data   
INFO[0004] Accepted tcp connection from 127.0.0.1:13218 (1/0) 
DEBU[0004] 127.0.0.1:13218 (v0) / Serial: 0: has bad version (received: v3, current: v0) error 
INFO[0004] Disconnecting client 127.0.0.1:13218 (v0) / Serial: 0 
DEBU[0004] 127.0.0.1:13218 (v0) / Serial: 0: Received PDU Reset Query v3 
DEBU[0004] 127.0.0.1:13218 (v0) / Serial: 0 > Request Cache 
DEBU[0004] 127.0.0.1:13218 (v0) / Serial: 0 < No data   
INFO[0004] Accepted tcp connection from 127.0.0.1:28853 (1/0) 
DEBU[0004] 127.0.0.1:28853 (v2) / Serial: 0: Received PDU Reset Query v2 
DEBU[0004] 127.0.0.1:28853 (v2) / Serial: 0 > Request Cache 
DEBU[0004] 127.0.0.1:28853 (v2) / Serial: 0 < No data   

Thanks for the quick fix. RTR Version negotiation is one big cluster f*.