Closed jwbensley closed 2 years ago
Quick question, why not make this feature available via -S
command line argument? Did you see a clash in functionality of sorts?
Quick question, why not make this feature available via
-S
command line argument? Did you see a clash in functionality of sorts?
I didn't want to change existing behavior in case it breaks some existing automation for some users.
Great work!
I have one question:
Shouldn't this behavior be the new default as opposed to make this a non-default optional?
The output of bgpq4 AS-MCKAY
changes before/after this PR only if the as-set contains references in the SOURCE::
format. It would not change the behavior otherwise. Of course bgpq4 RIPE::AS-MCKAY
behavior would be different.
I think changing the default behavior into something more secure and more inline what we would actually expect (honoring the SOURCE:: format) would a good thing.
I'm actually more afraid what the non-zero exit codes and stderr output does for complex automation in a lot of cases (the example with ./bgpq4 -o RADB::AS-GOOGLE | grep -v "prefix"
).
I think there are two different goals here:
SOURCE::
In my opinion those are two different goals and while related, I would rather not make one big change and put them behind the same optional flag.
I'd make goal 1 default, frankly. However I'd certainly not marry goal 2 with goal 1, because it would hamper the adoption of goal 1.
If we do this, people will either:
uuh, the new "improvement" in bgpq4 triggers the pagers at 02 AM, no time, rolling back ...
)I mean here are a few networks (using peeringdb's IRR as-set/route-set), some bigger, some smaller, even some content that trigger errors. And I did not search long at all:
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o RADB::AS-HURRICANE 2>&1 >/dev/null | grep "ERROR" | wc -l
232
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-CORE-BACKBONE 2>&1 >/dev/null | grep "ERROR" | wc -l
1641
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o RIPE::AS41327:AS-CUSTOMERS 2>&1 >/dev/null | grep "ERROR" | wc -l
601
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-MEER 2>&1 >/dev/null | grep "ERROR" | wc -l
813
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-AKAMAI 2>&1 >/dev/null | grep "ERROR" | wc -l
4
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-CLOUDFLARE 2>&1 >/dev/null | grep "ERROR" | wc -l
7
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-MICROSOFT 2>&1 >/dev/null | grep "ERROR" | wc -l
2
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -o AS-PROLE 2>&1 >/dev/null | grep "ERROR" | wc -l
11
lukas@dev:~$
With IPv6 it gets a lot worse (which is pretty much expected since not every IPv4 network is running an IPv6 network):
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-AKAMAI 2>&1 >/dev/null | grep "ERROR" | wc -l
5
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-CLOUDFLARE 2>&1 >/dev/null | grep "ERROR" | wc -l
389
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-MICROSOFT 2>&1 >/dev/null | grep "ERROR" | wc -l
11
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-PROLE 2>&1 >/dev/null | grep "ERROR" | wc -l
777
lukas@dev:~$
So the amount of errors in real life on an typical Internet exchange, even for as-set's of content networks (I'm not even talking the obvious worse case: large Tier 2 networks) would just be mind boggling to the point where there is no way other then to dump it to /dev/null.
At least for now.
Do we want to actually output errors to stderr to improve RPSL accuracy? Probably, but that's a real long term goal, while honoring SOURCE::
should be a short time goal.
So the discussion should probably be must goal 1 be optional or can we just honor SOURCE::
always. What is the use-case for not honoring it (other than not getting bombarded with thousands of errors due to goal 2)?
small technical nitpicks:
@lukastribus Thanks for the in-depth review!
Regarding two goals
So yeah I had a similar thought to you, there are kind of two goals in this PR which could be separated. The reason I didn't split them is because when specifying a specific data source for an AS-SET or route set, and that data source doesn't exist, then we need to throw and error and return non-zero exist status. What would be the point in being able to specify a data source for an object if we don't throw an error when the data source doesn't exist?
Also, I can foresee quite a bit of discussion. Should we all just jump on a Teams call / Zoom meeting /
Regarding making -o the default behavior
I already mentioned that I didn't want to break existing implementations due to the change in exit status codes. But there is another reason I forgot to mention. If you re-read this comment, bgpq4 normally uses a recursive !a
query or !i,1
query to irrd. In order to allow every single member in an AS-SET to specify it's own source, bgpq4 has to send an !s
query + a non-recursive !i
query for every member in the AS-SET or route set. So in the most extreme case (if everyone used "::" notation in the IRR DBs) this would doubling the number of operations we send to the irrd server, which is "inconsiderate" perhaps? The idea of this PR is that we have a live issue that is affecting daily operations right now, this PR helps us resolve that issue now, and anyone with the same issue could use -o
, but in the long term we should also raise a PR for irrd to add a new query type which will honour the source "::" of each member in a recursive query (!a
or !i,1
).
(We are discussing writing/raising this PR for irrd internally, and then one for bgpq4 to use that new query type, but it would be in a few weeks/or couple of months, and we need a fix sooner, hence this bgpq4 PR).
What would be the point in being able to specify a data source for an object if we don't throw an error when the data source doesn't exist?
I think the point is to get data from the correct SOURCE::
, avoiding compromised filters and therefor avoid things like the AS-GOOGLE
incident which is basically a Denial of Service.
Unresolved dependencies on the other hand is a different beast and is not new at all. bgpq3/4 always ignored unresolved dependencies #1, because that is how filters are created (use correct data to permit, and reject everything else). Honoring SOURCE::
doesn't change that, SOURCE can already be restricted today and unresolved dependencies already exist today (a helluvalot actually). I think it is important to have this discussion, I just suggest to have this discussion separately and unrelated to the SOURCE::
change. I don't think there is a security issue or a new attack that is solved by handling this differently.
lukas@dev:~$ bgpq4/bgpq4 -S RIPE AS-HURRICANE
no ip prefix-list NN
! generated prefix-list NN is empty
ip prefix-list NN deny 0.0.0.0/0
lukas@dev:~$ echo $?
0
lukas@dev:~$
What is the operational consequence here? What is the operator supposed to do specifically with the 778 errors from generating v6 filters for Akamai Prolexic, 389 errors for Cloudflare, or over 9000 errors for HE?
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-PROLE 2>&1 >/dev/null | grep "ERROR" | wc -l
778
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-CLOUDFLARE 2>&1 >/dev/null | grep "ERROR" | wc -l
389
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o RADB::AS-HURRICANE 2>&1 >/dev/null | grep "ERROR" | wc -l
9312
lukas@dev:~$
I think this error output belongs to verbose mode.
Unresolved dependencies are very normal in RPSL, we would basically return non-zero exit codes and stderr all the time.
Should the operator ignore the filter output on stdout if bgpq4 returns non-zero? Probably, but then filter creation would not even work for FAANG companies...
Now we have different exit codes, based on whether we specify -o
, which imho should be orthogonal to the SOURCE::
handling.
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -S RIPE AS-HURRICANE
ERROR:Key not found expanding !a4AS-HURRICANE
no ip prefix-list NN
! generated prefix-list NN is empty
ip prefix-list NN deny 0.0.0.0/0
lukas@dev:~$ echo $?
0
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -S RIPE -o AS-HURRICANE
ERROR:Key not found expanding !iAS-HURRICANE
no ip prefix-list NN
! generated prefix-list NN is empty
ip prefix-list NN deny 0.0.0.0/0
lukas@dev:~$ echo $?
1
lukas@dev:~$
I just don't see how this would actually work in practice, without ignoring stderr and return code.
I'd like to talk about how operators actually use return codes and stderr output in real life scenarios on an Internet Exchanges while peering with different networks. What do we do with tens of thousands of errors on every bgpq4 run? What do we do with basically omni present non-zero return codes?
this would doubling the number of operations we send to the irrd server, which is "inconsiderate" perhaps
Thanks for clarifying. I'd predict a very gradual load increase though, considering that new codes paths in bgpq4 and especially SOURCE:
: references in as-set are not something that ship worldwide at the same time and with 100% coverage, but maybe this should be discussed with IRR operators.
I had a phone call with James, we agreed to make ::
trick available via the default UI path
I am traveling this weekend so I will refactor the PR then.
Just to confirm:
::
behavior default for the object entered on the CLI.::
because next step on my to-do list is to approach the RIPE DB WG chairs and discuss what it would take to add support for the ::
notation to the members field on an AS-SET in the DB. Shall I leave this in or take it out? I put it in originally to save me having to add it later.Ok thanks.
Just one request: can you fix the double newline after each error in stderr? Those empty lines in stderr are not pretty.
lukas@dev:~$ jwbensley-bgpq4/bgpq4 -6 -o AS-MICROSOFT >/dev/null
ERROR:Key not found expanding !6as8069
ERROR:Key not found expanding !6as8073
ERROR:Key not found expanding !6as20046
ERROR:Key not found expanding !6as20049
ERROR:Key not found expanding !6as20366
ERROR:Key not found expanding !6as23468
ERROR:Key not found expanding !6as30427
ERROR:Key not found expanding !6as62540
ERROR:Key not found expanding !6as132406
ERROR:Key not found expanding !6as197612
ERROR:Key not found expanding !6as198097
lukas@dev:~$
Thank you James!
@lukastribus Just one request: can you fix the double newline after each error in stderr?
yeah sorry I didn't mention that, for me it was implicit that I would fix it ;)
To save re-reading the whole thread, this is the current state of the PR:
It would be trivial at a later date to add support for the "::" notation for ASNs too, so that an AS-SET can have members like "RIPE:AS1122334455". First we need to get the syntax of the IRR DB "members" field updated to support the "::" notation (that's next on my to-do list).
It seems there is a bug for hierarchical as-set names such as AS15562:AS-SNIJDERS
:
feather$ ./bgpq4 -d RIPE::AS15562:AS-SNIJDERS
DEBUG: expander.c:1065 bgpq_expand Acquired sendbuf of 2097152 bytes
DEBUG: expander.c:1086 bgpq_expand Sending '!!' to server to request for the connection to remain open
DEBUG: expander.c:1096 bgpq_expand b->identify: Sending '!n bgpq4 1.6' to server.
DEBUG: expander.c:1109 bgpq_expand Got answer C
DEBUG: expander.c:1126 bgpq_expand Testing support for A queries
DEBUG: expander.c:1136 bgpq_expand Server supports A query
DEBUG: expander.c:446 bgpq_get_irrd_sources Requesting source list !s-lc
DEBUG: expander.c:457 bgpq_get_irrd_sources Got answer A138
NTTCOM,INTERNAL,LACNIC,RADB,RIPE,RIPE-NONAUTH,ALTDB,BELL,LEVEL3,APNIC,JPIRR,ARIN,BBOI,TC,AFRINIC,IDNIC,ARIN-WHOIS,RPKI,REGISTROBR,CANARIE
C
DEBUG: expander.c:548 bgpq_pipeline expander: sending !sRIPE
DEBUG: expander.c:548 bgpq_pipeline expander: sending !iAS-SNIJDERS
DEBUG: expander.c:548 bgpq_pipeline expander: sending !sNTTCOM,INTERNAL,LACNIC,RADB,RIPE,RIPE-NONAUTH,ALTDB,BELL,LEVEL3,APNIC,JPIRR,ARIN,BBOI,TC,AFRINIC,IDNIC,ARIN-WHOIS,RPKI,REGISTROBR,CANARIE
DEBUG: expander.c:829 bgpq_read No data expanding !sRIPE
DEBUG: expander.c:548 bgpq_pipeline expander: sending !sNTTCOM,INTERNAL,LACNIC,RADB,RIPE,RIPE-NONAUTH,ALTDB,BELL,LEVEL3,APNIC,JPIRR,ARIN,BBOI,TC,AFRINIC,IDNIC,ARIN-WHOIS,RPKI,REGISTROBR,CANARIE
DEBUG: expander.c:548 bgpq_pipeline expander: sending !iAS-SNIJDERS
DEBUG: expander.c:829 bgpq_read No data expanding !sNTTCOM,INTERNAL,LACNIC,RADB,RIPE,RIPE-NONAUTH,ALTDB,BELL,LEVEL3,APNIC,JPIRR,ARIN,BBOI,TC,AFRINIC,IDNIC,ARIN-WHOIS,RPKI,REGISTROBR,CANARIE
DEBUG: expander.c:829 bgpq_read No data expanding !sNTTCOM,INTERNAL,LACNIC,RADB,RIPE,RIPE-NONAUTH,ALTDB,BELL,LEVEL3,APNIC,JPIRR,ARIN,BBOI,TC,AFRINIC,IDNIC,ARIN-WHOIS,RPKI,REGISTROBR,CANARIE
no ip prefix-list NN
! generated prefix-list NN is empty
ip prefix-list NN deny 0.0.0.0/0
@job hmm, this was working for me, I will test again.
Fixed now...
These are all working for me (please check for your self):
./bgpq4 AS12654
./bgpq4 AS12654:RS-RIS
./bgpq4 RIPE::AS12654:RS-RIS
./bgpq4 AS15562
./bgpq4 AS-SNIJDERS
./bgpq4 AS15562:AS-SNIJDERS
./bgpq4 RIPE::AS15562:AS-SNIJDERS
./bgpq4 RIPE::AS12654
is not supported yes as I mentioned further up in the thread, but it could be added in a subsequent PR easily enough.
I think we are almost there, there seems to be a difference, compare:
feather$ ./bgpq4 RADB::AS-AMAZON | sha256
ERROR:Key not found expanding !gas10124
ERROR:Key not found expanding !gas17493
ERROR:Key not found expanding !gas39111
099329ac8b0c6880addaba336c7411e866c0a4fc7e0c9036bb336696624aa180
with
feather$ ./bgpq4 AS-AMAZON | sha256
099329ac8b0c6880addaba336c7411e866c0a4fc7e0c9036bb336696624aa180
is this expected or desirable?
This was discussed further up this thread, but this thread is getting a bit long so I'll recap here for you:
The SHA256 hash is confirming that the stdout output is the same (the generated prefix list). I generated a bunch of different prefixes with the original bgpq4 and this PR version and diff
'ed them and found no differences too.
The difference in your example output is that currently bgpq4 doesn't produce any error messages to stderr when there are problems, so I added some error message which should have already been there (in my opinion):
OK, but why are there different error messages when using this PR version of bgpq4 against the same AS-SET: /bgpq4 RADB::AS-AMAZON
vs ./bgpq4 AS-AMAZON
?
When I run the current/published version $ ./bgpq4 -4 AS-AMAZON
, bgpq4 has a hard coded exit status of 0
and no error messages are displayed. This leads the user to falsely believe that they generated a full prefix list for all ASNs in the AS-SET tree for AS-AMAZON. This is not the case though. There are currently errors occurring, this could be for a number of different reasons (it could be that an AS-SET contains a member that doesn't exist, or the member name contains a typo, or it could be a problem with IRRd missing some data).
In both the existing version of bgpq4 and this PR version of bgpq4, when NOT using the ::
notation a single recursive !a
query is run or a single recursive !i,1
query, and IRRd provides a single response with the entire tree fully expanded. There may have been errors during that expansion process, but IRRd provides no indication of this, it returned "something", even if it was partial, so that is considered a success.
When using the ::
notation in the PR version, we can't use a recursive query, which means a !i
query is run for every member in the tree. This allows IRRd to provide "feedback", that it couldn't expand the single member being queried, which provides bgpq4 the opportunity to then log something to stderr, so that the user knows there was a problem and can look into fixing the issue.
These added error messages statements could be...
What do you prefer?
Main Feature
This PR adds a new CLI option (
-o
) to specify that only specific data sources should be used for AS-SET and Route-Set expansion. Those data sources are specified using the "::" notation e.g., "RADB::AS-GOOGLE". This is to resolve #5, specifically this implements "Part 1" of the idea mentioned in this comment (a separate PR could follow at a later date for "Part 2").A user may already specify one or more data sources using the
-S
option. In the case that-S
is used, the given data sources are saved as the default data source. If-S
isn't used, the irrd server is queried for it's list of sources and these are saved as the default data source.When using the new
-o
option e.g.,bgpq4 -o RIPE::AS-MCKAY
with no-S
option, RIPE is set to the data source in this example, then a query is sent to irrd to perform a non-recursive expansion of AS-MCKAY, and then for each member of the AS-SET, one of two things happens;This continues down the tree, honouring any sources found along the way using the "::" syntax, and using the default data sources when no specific source is given.
Other stuff
There are quite a few things that could be improve but opening a massive PR which changes lots of things isn't helpful. These are the other things this PR fixes "along the way" in order to get the main feature working nicely:
There are missing
close()
calls for file descriptors used, I have added someclose(fd)
calls.In the following places the program stops what it is doing and returns or exits without any error message:
https://github.com/bgp/bgpq4/blob/main/expander.c#L200 https://github.com/bgp/bgpq4/blob/main/expander.c#L203 https://github.com/bgp/bgpq4/blob/main/expander.c#L734 https://github.com/bgp/bgpq4/blob/main/expander.c#L844 https://github.com/bgp/bgpq4/blob/main/expander.c#L850
I have added error messages to these locations so that they aren't silent errors.
err()
when there is an error or issue, in others it callsexit()
, and in other places it callssx_report(SX_FATAL, ...)
and thenexit(1)
which is confusing becausesx_report(SX_FATAL, ...)
callsexit(-1)
so the extra exit with a different return code is confusing to debug:https://github.com/bgp/bgpq4/blob/main/expander.c#L402 https://github.com/bgp/bgpq4/blob/main/expander.c#L745 https://github.com/bgp/bgpq4/blob/main/expander.c#L811 https://github.com/bgp/bgpq4/blob/main/expander.c#L814
I have removed
exit()
calls after a fatal error.Inside expander.c the following functions return 1 on success and 0 on failure/error:
These two functions return 0 on success and failure:
The are the two worker functions in expander.c, the former for pipeline mode and later for non-pipeline mode. I have changed these last two functions to return 1 on success and 0 on failure/error when there is an error inside these function, so that there is consistent success/error status codes across all functions in expander.c.
At no point where any of the above functions are called, is the return status actually checked. Also the status of the callback functions used by
bgpq_read()
andbgpq_expand_irrd()
isn't checked. Also, the "main" function in expander.c calledint bgpq_expand()
is hard coded to return 1. bgpq4 checks the return status of this function inmain()
, but because the function returns 1 (and for this function 1 means "success"), bgpq4 always exits with exit-status 0 even though there are in fact errors occurring (unless the error is right at the start like parsing the CLI args/opts).I have added checks to the calls used by the new
-o
option in expander.c, so that the return status is passed through each function to the final exit status of bgpq4 (from callback function -> tobgpq_read()
/bgpq_expand_irrd()
-> tobgpq_expand()
-> tomain()
.The combination of points 2, 3 and 4 above is that now when I run
bgpq4 -o RADB::AS-GOOGLE
, bgpq4 generates as much of the prefix list as it can, but due to some missing data in irrd an error is printed and I have a non-zero exit status. So I can bring up a peering session with AS-GOOGLE AND I know I need to tell them they have a problem with their IRR data.Without this, which is how bgpq4 currently operates, users are not told that a specific source wasn't valid for example. This is an example: https://github.com/bgp/bgpq4/blob/main/expander.c#L683
AS-MCKAY is in RIPE, it contains an AS-SET in ARIN though, so prefixes are missing from this prefix set, but the exist status is 0. I know this is kind of by design, but I think it's a bit miss-leading for the user, there should at least be an error to say "one of the members of AS-MCKAY was not found in the the RIPE data".
This is from my PR:
This hasn't changed the default behaviour of bgpq4, but when using the new
-o
option an error message is at least shown to the user and the exit status is non-zero so if they are using bgpq4 in an automated way and logging the output, they can check their logs to at least get a clue that something needs a deeper look.Testing
I've tried to test this as thoroughly as I can on the CLI with all different combinations. I'm happy it's working. We are now loading this binary onto our prefix generation server, which generates prefix lists towards all our peers and customers every night, so we will also do some production testing too.
Example: Generate the prefix-list for a specific peer and specify the source for the peers' AS-SET or Route-Set (prefix-list excluded with grep for brevity). Here we can see there are errors, but the prefix list is generated for everything that is possible (just as before), the difference being that AS-GOOGLE was specifically picked from RADB, but downstream members used the irrd servers default data sources (which is all data sources):
Example: Thrown an error when the AS-SET isn't available from the specified data source:
Example: Specify the data sources to use for all member objects, the AS-SET specified on the CLI which will only use RIPE: