Optimize VRP struct sizes. Use netip.Prefix instead of net.IPNet.

floatingstatic commented 7 months ago

I'm evaluating stayrtr in a memory constrained environment. The memory footprint of stayrtr could be optimized a bit. It seems the bulk of the memory is allocated during updates but there are also some things that could be modified to reduce memory usage during steady state. To do this I'm proposing the following changes:

Move from go 1.17 to 1.21
Use netip.Prefix instead of net.IPNet in a few structs to reduce field sizes (requires at least go1.18 which is why we change the base go version in go.mod). Bulk of the lines changed in this PR are related to this change but generally do not change much of the logic. We simply swap out equivalent functions in netip in place of those from net
Reordering fields in the VRP struct to reduce the size of the struct from 64 bytes to 40 bytes

Current:

$ structlayout . VRP
VRP.Prefix.IP net.IP: 0-24 (size 24, align 8)
VRP.Prefix.Mask net.IPMask: 24-48 (size 24, align 8)
VRP.MaxLen uint8: 48-49 (size 1, align 1)
padding: 49-52 (size 3, align 0)
VRP.ASN uint32: 52-56 (size 4, align 4)
VRP.Flags uint8: 56-57 (size 1, align 1)
padding: 57-64 (size 7, align 0)

New:

$ structlayout  . VRP
VRP.Prefix.ip.addr.hi uint64: 0-8 (size 8, align 8)
VRP.Prefix.ip.addr.lo uint64: 8-16 (size 8, align 8)
VRP.Prefix.ip.z *internal/intern.Value: 16-24 (size 8, align 8)
VRP.Prefix.bitsPlusOne uint8: 24-25 (size 1, align 1)
padding: 25-32 (size 7, align 0)
VRP.ASN uint32: 32-36 (size 4, align 4)
VRP.MaxLen uint8: 36-37 (size 1, align 1)
VRP.Flags uint8: 37-38 (size 1, align 1)
padding: 38-40 (size 2, align 0)

Reordering fields in the PDUIPv4Prefix and PDUIPv6Prefix structs from 72 bytes to 40 bytes

Current:

$ structlayout . PDUIPv4Prefix
PDUIPv4Prefix.Version uint8: 0-1 (size 1, align 1)
padding: 1-8 (size 7, align 0)
PDUIPv4Prefix.Prefix.IP net.IP: 8-32 (size 24, align 8)
PDUIPv4Prefix.Prefix.Mask net.IPMask: 32-56 (size 24, align 8)
PDUIPv4Prefix.MaxLen uint8: 56-57 (size 1, align 1)
padding: 57-60 (size 3, align 0)
PDUIPv4Prefix.ASN uint32: 60-64 (size 4, align 4)
PDUIPv4Prefix.Flags uint8: 64-65 (size 1, align 1)
padding: 65-72 (size 7, align 0)

$ structlayout . PDUIPv6Prefix
PDUIPv6Prefix.Version uint8: 0-1 (size 1, align 1)
padding: 1-8 (size 7, align 0)
PDUIPv6Prefix.Prefix.IP net.IP: 8-32 (size 24, align 8)
PDUIPv6Prefix.Prefix.Mask net.IPMask: 32-56 (size 24, align 8)
PDUIPv6Prefix.MaxLen uint8: 56-57 (size 1, align 1)
padding: 57-60 (size 3, align 0)
PDUIPv6Prefix.ASN uint32: 60-64 (size 4, align 4)
PDUIPv6Prefix.Flags uint8: 64-65 (size 1, align 1)
padding: 65-72 (size 7, align 0)

New:

$ structlayout . PDUIPv4Prefix
PDUIPv4Prefix.Prefix.ip.addr.hi uint64: 0-8 (size 8, align 8)
PDUIPv4Prefix.Prefix.ip.addr.lo uint64: 8-16 (size 8, align 8)
PDUIPv4Prefix.Prefix.ip.z *internal/intern.Value: 16-24 (size 8, align 8)
PDUIPv4Prefix.Prefix.bitsPlusOne uint8: 24-25 (size 1, align 1)
padding: 25-32 (size 7, align 0)
PDUIPv4Prefix.ASN uint32: 32-36 (size 4, align 4)
PDUIPv4Prefix.Version uint8: 36-37 (size 1, align 1)
PDUIPv4Prefix.MaxLen uint8: 37-38 (size 1, align 1)
PDUIPv4Prefix.Flags uint8: 38-39 (size 1, align 1)
padding: 39-40 (size 1, align 0)

$ structlayout . PDUIPv6Prefix
PDUIPv6Prefix.Prefix.ip.addr.hi uint64: 0-8 (size 8, align 8)
PDUIPv6Prefix.Prefix.ip.addr.lo uint64: 8-16 (size 8, align 8)
PDUIPv6Prefix.Prefix.ip.z *internal/intern.Value: 16-24 (size 8, align 8)
PDUIPv6Prefix.Prefix.bitsPlusOne uint8: 24-25 (size 1, align 1)
padding: 25-32 (size 7, align 0)
PDUIPv6Prefix.ASN uint32: 32-36 (size 4, align 4)
PDUIPv6Prefix.Version uint8: 36-37 (size 1, align 1)
PDUIPv6Prefix.MaxLen uint8: 37-38 (size 1, align 1)
PDUIPv6Prefix.Flags uint8: 38-39 (size 1, align 1)
padding: 39-40 (size 1, align 0)

Additional minor fixes to try to reduce memory, for example adding a boolean to not return unchanged vrps from ComputeDiff() except in tests as these are not needed other than for debug logging. We also modify ConvertSDListToMap() to only return a map with values containing previous flags instead of the entire struct. As best I can tell we only reference the value of this map in one place and only to get previous flag values.

Doing a simple side-by-side test with stayrtr started at roughly the same time on two hosts we can compare current master branch (yellow) memory RSS vs my forked branch (green):

For what its worth it seems most of the memory allocations occurs when updating VRPs. It appears the bulk of this is related to copying data which I haven't found a clean workaround for but may be something that could be addressed in a future pr. A view of this from pprof showing lots of this from VRP.Copy()

Beyond ensuring all test cases pass I have also run integration tests with BIRD 2.14 and confirm that RTR functionality continues to function as before with these changes.

benjojo commented 7 months ago

On the surface this looks fine, I would however want to see if two stayrtr's (old and this PR) running head to head for a few days with rtrmon comparing the two, just to make sure :)

ties commented 7 months ago

Very nice analysis and a clear change!

On the surface this looks fine, I would however want to see if two stayrtr's (old and this PR) running head to head for a few days with rtrmon comparing the two, just to make sure :)

That is probably the best way to be sure it is functionally identical. This way you can check that the VRPs converge after a set delay. The relevant metric is vrp_diff with a threshold that is high enough for stayrtr-current and stayrtr-thispr to update.

floatingstatic commented 7 months ago

For clarification, is this an experiment you need me to run or are you doing this?

ties commented 7 months ago

For clarification, is this an experiment you need me to run or are you doing this?

I would prefer it if you ran it (I likely won't have time until 2024). Do you have a working setup with prometheus?

If we agree that convergence should be within 256s, I would check the history for violations of that property

max_over_time(vrp_diff{visibility_seconds="256"}[1h]) > 0

floatingstatic commented 7 months ago

Ok hopefully I did this right, I ran something like this on a linux host with binaries built from my branch:

./stayrtr -bind 127.0.0.1:8282 -cache https://console.rpki-client.org/vrps.json -metrics.addr 127.0.0.1:9847
./rtrmon -addr 0.0.0.0:9866 -primary.host tcp://127.0.0.1:8282 -secondary.host https://console.rpki-client.org/vrps.json

Output looks like the below in prom. I do not see anything with label visibility_seconds="256", not sure if unexpected:

For whats its worth running a brief test using binaries built from upstream looks the same. Anyway I will leave this running for a couple of days and report back. Thanks!

ties commented 7 months ago

Ok hopefully I did this right, I ran something like this on a linux host with binaries built from my branch:
./stayrtr -bind 127.0.0.1:8282 -cache https://console.rpki-client.org/vrps.json -metrics.addr 127.0.0.1:9847
./rtrmon -addr 0.0.0.0:9866 -primary.host tcp://127.0.0.1:8282 -secondary.host https://console.rpki-client.org/vrps.json
Output looks like the below in prom. I do not see anything with label visibility_seconds="256", not sure if unexpected:

Those labels should be there eventually. Let's see!

cjeker commented 7 months ago

This not only blocks #105 but also additional work to make the delta handling for ASPA correct. It would be nice if this could get some priority to unblock this work.

floatingstatic commented 7 months ago

@ties I have had 2 instances of stayrtr running along with 2 instances of rtrmon in my local dev environment for a couple of days. There are 2 jobs, rtrmon and rtrmon-upstream which cover binaries from my forked branch and current upstream respectively. They were not started at exactly the same time so there are some minor deltas. The output looks as follows:

I'm not sure exactly what you are looking for here, happy to share any additional views you may want to see here.

ties commented 7 months ago

@ties I have had 2 instances of stayrtr running along with 2 instances of rtrmon in my local dev environment for a couple of days. There are 2 jobs, rtrmon and rtrmon-upstream which cover binaries from my forked branch and current upstream respectively. They were not started at exactly the same time so there are some minor deltas. The output looks as follows:

Thanks for doing this!

I'm not sure exactly what you are looking for here, happy to share any additional views you may want to see here.

I hoped to see a flat line, as in "the worst case divergence converges within 256s". Can you try the following case?

max_over_time(vrp_diff{visibility_seconds="851"}[1h]) > 0 and check if that is flat at 0 (if not, 851,1024,1706,3411 are also possible).

Alternatively, rtrmon can compare two stayrtrs, but reading the same JSON should also work if the JSON is from one source of truth.

ties commented 7 months ago

I just realised the default refresh interval is 600s; in that case, 851 or 1024 are the first values that are likely to converge.

I am also collecting data myself now, I should be able to give an update tomorrow.

floatingstatic commented 7 months ago

I'm starting to wonder if this isn't the best way to check this or if I am doing something wrong. Yes, I am using default timers for everything but both upstream and my fork are showing a non-zero delta so either the testing methodology is incorrect or I'm doing something wrong here:

for what its worth the upstream deltas look "worse" to my eye.

So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.

job commented 7 months ago

So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.

that seems helpful indeed - before merging

ties commented 6 months ago

I'm starting to wonder if this isn't the best way to check this or if I am doing something wrong. Yes, I am using default timers for everything but both upstream and my fork are showing a non-zero delta so either the testing methodology is incorrect or I'm doing something wrong here:

Or the system does not converge as fast as I thought it would. I mostly run these tools in a monitoring role with (much) lower timers than you would want to run in real life.

So it sounds like you suggest perhaps running two stayrtr's (fork and master branch) and configure rtrmon to diff those two instead? I can give that a try and report back.

That would be the comparative test that I think will work (I'm writing this before I looked at my data). For data collection I used this setup:

# build the main branch + this PR and tag them, i.e.
# docker build . -t stayrtr-structoptimize --target stayrtr
services:
  stayrtr-master:
    image: stayrtr-master
    command:
      - -refresh
      - "60"
      - -cache
      - https://rpki-validator.ripe.net/json
  stayrtr-structoptimize:
    image: stayrtr-structoptimize
    command:
      - -refresh
      - "60"
      - -cache
      - https://rpki-validator.ripe.net/json
      # - https://console.rpki-client.org/rpki.json
  rtrmon:
    image: rpki/rtrmon
    restart: unless-stopped
    ports:
      - 9866:9866
    command:
      - -primary.host
      - tcp://stayrtr-master:8282
      - -secondary.host
      - tcp://stayrtr-structoptimize:8282

My reason for picking rpki-validator.ripe.net/json is that it has a very frequent update interval.

Now, looking at the data, we see that 1024s was converged between the two instances at all times:

If we look at all the points in time, there were tiny differences that recovered (more aggressive timers might help there).

I forgot to capture the metrics directly from stayrtr so I do not know the comparative memory usage. But I trust your statistics on that.

It looks good to me. LGTM!

bgp / stayrtr

Optimize VRP struct sizes. Use netip.Prefix instead of net.IPNet. #103