Closed floatingstatic closed 6 months ago
On the surface this looks fine, I would however want to see if two stayrtr's (old and this PR) running head to head for a few days with rtrmon comparing the two, just to make sure :)
Very nice analysis and a clear change!
On the surface this looks fine, I would however want to see if two stayrtr's (old and this PR) running head to head for a few days with rtrmon comparing the two, just to make sure :)
That is probably the best way to be sure it is functionally identical. This way you can check that the VRPs converge after a set delay. The relevant metric is vrp_diff
with a threshold that is high enough for stayrtr-current
and stayrtr-thispr
to update.
For clarification, is this an experiment you need me to run or are you doing this?
For clarification, is this an experiment you need me to run or are you doing this?
I would prefer it if you ran it (I likely won't have time until 2024). Do you have a working setup with prometheus?
If we agree that convergence should be within 256s, I would check the history for violations of that property
max_over_time(vrp_diff{visibility_seconds="256"}[1h]) > 0
Ok hopefully I did this right, I ran something like this on a linux host with binaries built from my branch:
./stayrtr -bind 127.0.0.1:8282 -cache https://console.rpki-client.org/vrps.json -metrics.addr 127.0.0.1:9847
./rtrmon -addr 0.0.0.0:9866 -primary.host tcp://127.0.0.1:8282 -secondary.host https://console.rpki-client.org/vrps.json
Output looks like the below in prom. I do not see anything with label visibility_seconds="256"
, not sure if unexpected:
For whats its worth running a brief test using binaries built from upstream looks the same. Anyway I will leave this running for a couple of days and report back. Thanks!
Ok hopefully I did this right, I ran something like this on a linux host with binaries built from my branch:
./stayrtr -bind 127.0.0.1:8282 -cache https://console.rpki-client.org/vrps.json -metrics.addr 127.0.0.1:9847 ./rtrmon -addr 0.0.0.0:9866 -primary.host tcp://127.0.0.1:8282 -secondary.host https://console.rpki-client.org/vrps.json
Output looks like the below in prom. I do not see anything with label
visibility_seconds="256"
, not sure if unexpected:
Those labels should be there eventually. Let's see!
This not only blocks #105 but also additional work to make the delta handling for ASPA correct. It would be nice if this could get some priority to unblock this work.
@ties I have had 2 instances of stayrtr
running along with 2 instances of rtrmon
in my local dev environment for a couple of days. There are 2 jobs, rtrmon
and rtrmon-upstream
which cover binaries from my forked branch and current upstream respectively. They were not started at exactly the same time so there are some minor deltas. The output looks as follows:
I'm not sure exactly what you are looking for here, happy to share any additional views you may want to see here.
@ties I have had 2 instances of
stayrtr
running along with 2 instances ofrtrmon
in my local dev environment for a couple of days. There are 2 jobs,rtrmon
andrtrmon-upstream
which cover binaries from my forked branch and current upstream respectively. They were not started at exactly the same time so there are some minor deltas. The output looks as follows:
Thanks for doing this!
I'm not sure exactly what you are looking for here, happy to share any additional views you may want to see here.
I hoped to see a flat line, as in "the worst case divergence converges within 256s". Can you try the following case?
max_over_time(vrp_diff{visibility_seconds="851"}[1h]) > 0
and check if that is flat at 0 (if not, 851,1024,1706,3411 are also possible).Alternatively, rtrmon
can compare two stayrtrs, but reading the same JSON should also work if the JSON is from one source of truth.
I just realised the default refresh interval is 600s; in that case, 851 or 1024 are the first values that are likely to converge.
I am also collecting data myself now, I should be able to give an update tomorrow.
I'm starting to wonder if this isn't the best way to check this or if I am doing something wrong. Yes, I am using default timers for everything but both upstream and my fork are showing a non-zero delta so either the testing methodology is incorrect or I'm doing something wrong here:
for what its worth the upstream deltas look "worse" to my eye.
So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.
So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.
that seems helpful indeed - before merging
I'm starting to wonder if this isn't the best way to check this or if I am doing something wrong. Yes, I am using default timers for everything but both upstream and my fork are showing a non-zero delta so either the testing methodology is incorrect or I'm doing something wrong here:
Or the system does not converge as fast as I thought it would. I mostly run these tools in a monitoring role with (much) lower timers than you would want to run in real life.
So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.
That would be the comparative test that I think will work (I'm writing this before I looked at my data). For data collection I used this setup:
# build the main branch + this PR and tag them, i.e.
# docker build . -t stayrtr-structoptimize --target stayrtr
services:
stayrtr-master:
image: stayrtr-master
command:
- -refresh
- "60"
- -cache
- https://rpki-validator.ripe.net/json
stayrtr-structoptimize:
image: stayrtr-structoptimize
command:
- -refresh
- "60"
- -cache
- https://rpki-validator.ripe.net/json
# - https://console.rpki-client.org/rpki.json
rtrmon:
image: rpki/rtrmon
restart: unless-stopped
ports:
- 9866:9866
command:
- -primary.host
- tcp://stayrtr-master:8282
- -secondary.host
- tcp://stayrtr-structoptimize:8282
My reason for picking rpki-validator.ripe.net/json
is that it has a very frequent update interval.
Now, looking at the data, we see that 1024s was converged between the two instances at all times:
If we look at all the points in time, there were tiny differences that recovered (more aggressive timers might help there).
I forgot to capture the metrics directly from stayrtr so I do not know the comparative memory usage. But I trust your statistics on that.
It looks good to me. LGTM!
I'm evaluating stayrtr in a memory constrained environment. The memory footprint of stayrtr could be optimized a bit. It seems the bulk of the memory is allocated during updates but there are also some things that could be modified to reduce memory usage during steady state. To do this I'm proposing the following changes:
Move from go 1.17 to 1.21
Use
netip.Prefix
instead ofnet.IPNet
in a few structs to reduce field sizes (requires at least go1.18 which is why we change the base go version in go.mod). Bulk of the lines changed in this PR are related to this change but generally do not change much of the logic. We simply swap out equivalent functions innetip
in place of those fromnet
Reordering fields in the
VRP
struct to reduce the size of the struct from 64 bytes to 40 bytesCurrent:
New:
PDUIPv4Prefix
andPDUIPv6Prefix
structs from 72 bytes to 40 bytesCurrent:
New:
ComputeDiff()
except in tests as these are not needed other than for debug logging. We also modifyConvertSDListToMap()
to only return a map with values containing previous flags instead of the entire struct. As best I can tell we only reference the value of this map in one place and only to get previous flag values.Doing a simple side-by-side test with stayrtr started at roughly the same time on two hosts we can compare current
master
branch (yellow) memory RSS vs my forked branch (green):For what its worth it seems most of the memory allocations occurs when updating VRPs. It appears the bulk of this is related to copying data which I haven't found a clean workaround for but may be something that could be addressed in a future pr. A view of this from pprof showing lots of this from
VRP.Copy()
Beyond ensuring all test cases pass I have also run integration tests with BIRD 2.14 and confirm that RTR functionality continues to function as before with these changes.