NICMx / Jool

SIIT and NAT64 for Linux
GNU General Public License v2.0
326 stars 66 forks source link

pool4 is underperforming #214

Closed ydahhrk closed 8 years ago

ydahhrk commented 8 years ago

Started here. If my theory is correct, the bug is not tied to --mark; it's tied to the amount of rows pool4 contains.

Even if --mark is at fault, this bug should be addressed because, even though RFC7422 and --mark are (to most purposes) roughly the same feature, and the former should naturally scale more elegantly as more clients need to be serviced, the latter is more versatile since it matches clients arbitrarily by means of iptables rules. So I don't see any reasons to drop --mark once RFC7422 is implemented.

I believe this is the symptom that needs to be addressed:

When there are many pool4 entries, translation rate drops noticeably.
ydahhrk commented 8 years ago

New status:

Before getting to profiling I wanted to make sure I had a clear starting point from where to start optimizing so I tried extracting more data. I wound up with a different opinion on what's going on.

Here's the data. (See the README for details.)

From the fact that the "ip6tables full" curve stays stubbornly below "pool4 full", it looks like pool4 is actually less critical than ip6tables on performance. (At least when populated with 1500 rows.)

That is, I know the pool4 entry lookup can be optimized, but I don't think this will speed up translation as much as Pier hopes.

(@pierky: Do you get different results than me if you prolong the tests through several iperf calls?)

pierky commented 8 years ago

I performed some tests using the current test branch (https://github.com/NICMx/Jool/commit/ba4e7dbb4589c72a3d18a75ddca3b57734b95984); I used the same hardware and scenario of my previous tests (100 Mbps NIC, see #175), but I introduced the mark-randomizer module. Tests have been performed using 20 cycles of iperf.

Test 2 and 3 brought CPU to 100% because of hardware interrupts, test n. 1 only to 52%. @ydahhrk do you have similar results?

ydahhrk commented 8 years ago

Oh wow, we were planning to report at the same time :)

Before I analyse your data, I'd like to report my tests on the new code.

pool4 looks a lot faster now, at least compared to ip6tables. I even got rid of those annoying waves somehow.

Notice that the code was forked from the Jool 3.5 development branch, which might still not be in production status yet.

ydahhrk commented 8 years ago

Test 2 and 3 brought CPU to 100% because of hardware interrupts, test n. 1 only to 52%. @ydahhrk do you have similar results?

Oh, well I'd have to run the tests again, but I can believe it.

I take it that you think of that as a problem?

Is pool4 getting too full? Jool selects ports based on algorithm 3 of RFC 6056. This algorithm degrades "mostly" gracefully because reserved ports tend to scatter themselves randomly across the pool4 domain, which means that when there is a collision, finding a nearby unused port is relatively fast.

Until most ports are reserved that is. When pool4 is completely reserved, for example, the processor will waste a lot of time looping through the whole pool4 domain looking for an unused port.

This is an approximate representation of how port selection should degrade as the number of reserved bindings reach the limit imposed by pool4:

pool4performance

Maybe this can be optimized, too.

pierky commented 8 years ago

I'll run new tests too: my last ones used a short range of ports per pool4 entry (~ 30), so collisions may have negatively impaired performances. I'll try with more reserved ports per entry.

ydahhrk commented 8 years ago

We might be looking at a different problem than the port selection peak, actually.

I failed to mention it in my previous post, but the y-axis in the graph has an upper limit. (Which is the reason why I didn't draw it as an arrow.)

That limit is the amount of transport addresses the relevant pool4 mark has.

So if your client's mark only has 30 ports, then that graph degrades as shown, except the peak is at 30. In other words, the worst case scenario is 30 iterations. I don't think 30 iterations per packet are enough to freeze your processor busy. (Edit: I deleted a bunch of text here because it didn't really go anywhere.)

Edit: Actually, scratch that idea. If you ran 20 iperf calls, then each client is only using 20 out of the 30 pool4 addresses. Assuming the mark randomizer didn't cause clients to be mapped to marks that didn't belong to them, then you are not exhausting pool4 entries.

On the other hand, is it really degrading badly? I see two sort of comparable numbers (92.7 and 35.4/35.9) but that's not enough to draw a curve.

ydahhrk commented 8 years ago

(See edits above)

Questions:

pierky commented 8 years ago

Actually I have not a real target client count, I just wanted to see how better the new code was performing and, as you also said, it's very good to see how faster it is now. I hope to have more time to run new tests on a full GBE env and to build a config that looks like closer to a real life scenario (something like 1000/2000 ports per entry). At that time I'll capture more metrics to see how that curve degrades.

Edit: the "do you have similar results" was not related to CPU usage, I just wondered if you had same performance improvement with the new code, sorry, my fault.

ydahhrk commented 8 years ago

Oh, ok. Thank you. :)

Then I guess that's it for the moment. I'll go back to tweaking the other issues.

ydahhrk commented 8 years ago

So I noticed the other day that the new bottleneck (ip6tables accesing thousands of entries sequentially) can also be addressed by using a particular variation of the MARK target.

The basic idea is, we might not be able to prevent ip6tables from walking through the whole database, but we can condense several ip6tables entries into one.

The following graph shows the Mbits/sec that I gained by swapping 2048 MARK rules/2048 pool4 entries (yellow pattern) to the equivalent 1 MARKSRCRANGE rule/2048 pool4 entries (orange pattern). The blue pattern is 0 rules/0 pool4 entries:

mark-vs-marksrcrange

Here are the details of the experiment.

pierky commented 8 years ago

So, very good results here using the brand new MARKSRCRANGE (https://github.com/NICMx/mark-src-range/commit/0e3fde545504a7fb8b1d0ab815e43355c88ff11f).

I performed some measurements in the following scenario:

|--------|            |-------|            |----------|
| sender | --[GbE]--> | NAT64 | --[GbE]--> | receiver |
|--------|            |-------|            |----------|

NAT64 is always the same hardware, this time it's connected via GbE to other two hosts.

Using the old configuration (6400 jool/ip6tables rules, without MARKSRCRANGE and with mark-randomizer) I got results near to the previous ones: TCP 42.6 Mbps.

In the new configuration, I used the same branch of Jool I've already used in my previous test (ba4e7dbb4589c72a3d18a75ddca3b57734b95984), this time with 6656 pool4 entries (jool --pool4 prints 19968 samples) and 6656 source IPv6 prefixes marked using 26 ip6tables' MARKSRCRANGE rules (26 x /56 splitted in /64), with 1000 ports per source /64.

Sender's IPv6 address falls in the last ip6tables' MARKSRCRANGE rule.

Using iperf from sender to receiver, I got:

These are the same values I obtained using iperf between sender and NAT64 directly (peak at 929 Mbps TCP and 941 Mbps UDP).

ydahhrk commented 8 years ago

(^_^)/ \(^_^)

These are the same values I obtained using iperf between sender and NAT64 directly (peak at 929 Mbps TCP and 941 Mbps UDP).

Does this mean that translation adds no overhead whatsoever?

I find this a little too impressive (as in "worrying")

pierky commented 8 years ago

I find it odd too, this is why I wanted to report it here.

I had only little time to run these tests and I spent most of it setting up the new scenario with the two GbE-enabled hosts (sender and receiver). Hosts have been connected directly each other, just like in the "diagram" above, no switches nor other equipment have been used; I can't say I saw with my own eyes (for example with tcpdump) packets entering from the v6 interface and leaving from the v4 one, but the iperf commands used on sender (iperf -V -c 64:ff9b::10.0.0.2) and on receiver (iperf -s), togheter with the network topology and addressing scheme, make me confident that they were translated by Jool somehow. The source ports seen by receiver were consistent with those expected from the translation rule for the given source prefix. I also had a look at the top's output on NAT64 during a transfer and I saw the two CPUs involved with the NICs' interrupts raising to 90% and 30% (if I recall correctly).

Unfortunately it only remains to wait until the next week when I will spend some time on it again; now that the lab is already up I'll have more time for the real measurements. First of all I'll double check every step again and I'll capture and write down all the metrics aforementioned.

In the meanwhile any suggestion or hint to dispel this doubt will be very well appreciated. :-)

ydahhrk commented 8 years ago

Well, I'm guessing it'll most likely not add much valuable insight, but maybe the BIB/session tables can also be queried to validate Jool is actively doing what we're expecting it to.

pierky commented 8 years ago

I forgot to tell, but I also checked the --session's output and iperf sessions was there.

Yes, I feel like I'm missing something very big, maybe I'm not seeing the forest for the trees here. I'll reproduce the tests and I'll try to go deep into this unexpected (excessively well performing) behaviour.

ydahhrk commented 8 years ago

Thanks :)

pierky commented 8 years ago

So, everything seems fine with results I got last week.

Packets from sender toward receiver enter the v6 interface and leave out through the v4 interface, Jool translates them and uses the right IPv4:port expected from the MARKSRCRANGE's mark / pool4 entry mapping.

NAT64 is using CPU3 for the v6 interface's interrupts and CPU0 for v4 interface's ones.

IPv6-to-IPv6 tests from sender toward NAT64 give 928 Mbps, with max 20% CPU3 hardware interrupt (as reported by top).

IPv6-to-IPv4 tests from sender toward receiver give the same value, 928 Mbps, with max 90% CPU3 and 53% CPU1 hardware interrupts.

So, CPU3 (= v6 interface interrupts) raises from 20% to 90% but traffic keeps flowing without problems.

I tried to put the ip6tables rule that matches the sender's source address both at the end and in the middle of the ip6tables ruleset with no changes.

MARKSRCRANGE  all      2001:db8::/56       ::/0                marks 1-256 (0x1-0x100) /56/64
MARKSRCRANGE  all      2001:db8:1::/56     ::/0                marks 257-512 (0x101-0x200) /56/64
...
MARKSRCRANGE  all      2001:db8:15::/56    ::/0                marks 3841-4096 (0xf01-0x1000) /56/64
MARKSRCRANGE  all      2001:db8:1234:5600::/56  ::/0                marks 4097-4352 (0x1001-0x1100) /56/64
MARKSRCRANGE  all      2001:db8:16::/56    ::/0                marks 4353-4608 (0x1101-0x1200) /56/64
MARKSRCRANGE  all      2001:db8:17::/56    ::/0                marks 4609-4864 (0x1201-0x1300) /56/64
...
MARKSRCRANGE  all      2001:db8:30::/56    ::/0                marks 7937-8192 (0x1f01-0x2000) /56/64
4352    TCP     10.0.0.75       36000-36999
TCP,2001:db8:1234:56ff::2,58094,64:ff9b::a00:2,5001,10.0.0.75,36661,10.0.0.2,5001,00:03:03.60,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58096,64:ff9b::a00:2,5001,10.0.0.75,36662,10.0.0.2,5001,00:03:13.84,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58098,64:ff9b::a00:2,5001,10.0.0.75,36663,10.0.0.2,5001,00:03:23.116,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58100,64:ff9b::a00:2,5001,10.0.0.75,36664,10.0.0.2,5001,00:03:33.148,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58102,64:ff9b::a00:2,5001,10.0.0.75,36665,10.0.0.2,5001,00:03:43.180,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58104,64:ff9b::a00:2,5001,10.0.0.75,36666,10.0.0.2,5001,00:03:53.212,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58106,64:ff9b::a00:2,5001,10.0.0.75,36667,10.0.0.2,5001,02:00:00.0,ESTABLISHED
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36661
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36662
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36663
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36664
[  5]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36665
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36666
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36667
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36668
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36669
[  4]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36670
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec

NAT64 configuration:

Linux nat64-test 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
ydahhrk commented 8 years ago

Thank you for your efforts!

I'm guessing a single client is simply unable to saturate the network now that, assuming this configuration, Jool stopped being the bottleneck. If more clients and bandwidth are added to the mini-DOS attack, the CPUs are probably going to start hobbling.

Also, iperf only measures bandwidth. Other parameters might also inspire further insight. (latency, throughput, jitter...)

(But I'd say that is outside of the scope of this issue.)

ydahhrk commented 8 years ago

3.5 released; closing.