Closed JohnyGemityg closed 5 years ago
Wow, thank you for your hard work!
I could not figure out why is this code in the critical section. Are crypto_shash functions thread unsafe?
Probably not. I don't remember because this was 3 years ago, but my guess is that I immediately defaulted to lock all concurrent usage of the variable shash
simply because it is global and not constant. I tend to do that.
However, according to Linux history, this choice was completely misdirected: "shash is reentrant." So it does looks like the spinlock can be safely discarded.
I was, in fact, considering turning shash
into per-cpu variables to prevent the spinning, but the priority of this has always been below a bunch of other stuff. But the spinlock removal actually sounds like it could be done in a snap. (Perhaps in a pull request, even. ;) )
I was wondering if it is necessary to generate random Identification for every packet. I captured Taygas translations, and it seems that Tayga sets Identification to zero. Cisco IOS 15.4(1)T same story.
ID generation is mandatory now. They killed zero ID when they purged atomic fragments. I can't recall the rationale off the top of my head, but presumably, it should be buried somewhere in RFC 8021.
RFCs 8021 and 7915 are somewhat new. Tayga and Cisco are probably following the old rules.
Can be generation IPv4 Identification more optimized? For example, generate random Identification when bib entry is created and then just increment it with every packet?
Well, I'm going to be very surprised if we didn't think of just having a global monotonic counter. There should be a reason for this. I think there is some security risk if the ID is predictable, but this tends to slide off my brain over time. Let me check my e-mails.
Even if the ID is meant to be random, what's not mandatory is the usage of the get_random_bytes()
function. Maybe we could offer a slightly less random but faster option. If it's really slowing Jool down that much, I guess the kernel uses some other method that we should probably rip-off.
For example, generate random Identification when bib entry is created and then just increment it with every packet?
I guess the kernel uses some other method that we should probably rip-off.
Yeah, they seem to use __ip_select_ident()
.
It creates a random base number only the first time, and then increases monotonically. On subsequent calls, it takes less than 1/20th of the time get_random_bytes()
does.
Guess we should use that as well.
Uploaded optimizations to the issue282 branch. For the moment, the code only compiles in kernels 4.2+.
I found a quirk while tweaking: next_ephemeral
(from RFC 6056, algorithm 3) is not used. It seems the relevant code was lost during some old refactor. This is probably also slowing some operations down.
I'll try fixing it tomorrow.
God job! I am going to update the kernel on the test router and do the tests.
Kernel: 4.4.178-1.el7.elrepo.x86_64 Pure IPv6: 2.3 Mpps Jool-master: 0.72 Mpps Jool-issue282: 1.14 Mpps
I think this is a great result.
Nice!
Question: When you tested Jool in namespace, were the clients located in the same machine?
In all tests pc-gen generates traffic to pc-col through rtr-netx (nat64). Src: 2001:db8:111::2 dst: 2001:db8:4::192.168.112.2
Oh.
So isn't the 100% CPU utilization in the A.8.d graph rather worrying?
Yes it is. The PPS result in virtual network namespace is low. 444 Kpps in current setup. This means that the 10Gbit line is not saturated and the processor is overloaded. I suppose it's the namespace overhead.
Hmm. But it still sucks that A.8.c stays at 80% while iptables NAT (A.8.e) seems to stay at 40%. (What's with the holes?)
Are you planning to redo A.8.c and A.8.d with the new optimizations?
Otherwise, how are you generating the traffic?
The traffic is generated with PF_RING. It reads a PCAP file and play it to the network card.
The holes in IPv4 are due to bad flow distribution between cores. I can do it again with more traffic flows and utilize all cores.
Jool-282
Jool-282-namespace
Pure IPv6
Ok. Indeed, the results look reasonable and the performance seems solid to me.
I'll work on finishing the next_ephemeral
code and try to release Jool 4.0.1 next week.
If you find more bottlenecks, reports will be welcome.
Ok, thanks.
BTW: This was a pretty substantial contribution. If you want credits in Jool's README, just state what you'd like included.
It would be an honor! Jan Pokorny - FIT VUTBR Thank you.
Super interesting stuff @JohnyGemityg! :clap::+1:
I am assuming that in both situations, the Jool instances is of the Netfilter type? Have you tried the IPTables type too? It would be interesting to see if there is any difference in performance between the two.
If you have the time and opportunity, it would also be very interesting to see how SIIT mode compares to NAT64.
Hi @toreanderson,
yes, all tests were Netfilter type.
Tried the SIIT and the result is 2Mpps. 2Mpps is the maximum I can get on the current setup. So great result, no problems.
I also tried the Netfilter vs Iptables.
netfilter: 1139.4kpps iptables: 1154.9kpps
I would say no difference.
v4.0.1 released; closing bug.
Hi,
First of all, I want to thank you for the amazing work you did for NAT64.
I work in my thesis with NAT64 solutions for Linux including Jool. I did some performance evaluation. I want to share the results with you and discuss some optimization possibilities. I did some throughput tests on 10 Gbps topology.
NAT64 router: Intel(R) Xeon(R) CPU D-1587 @ 1.70GHz4x 4GB DDR4 2133MHzEthernet Connection X552 10 GbE SFP+Linux 3.10.0-693.17.1.el7.netx.x86_64
I tested Tayga, Jool, Jool in network namespace and pure 6, pure 4, iptables masquerade routing for comparison. Besides it, I captured a CPU load on the NAT64 router during tests. Here is the result.
Jool is performing well and can route around 1Mpps/s on my topology. That is great, but if I compare it to the regular iptables masquerade (3Mpps/s), there is some space for optimizations.
I tried to do some research and find what slows it down. I did some perf captures during tests. I figured out, that lot of time is spent waiting on locks.
From the first capture, I figured out that, the waiting is done in rfc6056_f function.
I suppose the function randomly assign port numbers based on MD5 checksums. The MD5 checksums are calculated using the crypto_shash functions in a critical section. I could not figure out why is this code in the critical section. Are crypto_shash functions thread unsafe? I experimented and I tried to remove the critical section (lock and unlock function remove). I seemed working, but there was no performance increase. I did another perf capture and found that now is time spent in get_random_bytes function, that is used for IPv4 Identification field for every packet.
I tried to remove the random Identification generation and set it to a static value. The performance increase was about 30%.
An interesting fact is that for 30% performance increase both "optimizations" must present. Removing just the get_random_bytes calls has no impact.
I was wondering if it is necessary to generate random Identification for every packet. I captured Taygas translations, and it seems that Tayga sets Identification to zero. Cisco IOS 15.4(1)T same story.
I read RFC 2765, 6145, 7915 each defines IP/ICMP Translation Algorithm and obsoletes the previous one. It seems the newest RFC 7915 tells that the Identification is now mandatory.
"Identification: Set according to a Fragment Identification generator at the translator."
On the other hand, generating random Identification causes reuse of the IP ID field to occur probabilistically RFC 4963.
Question is: is the critical section mandatory in rfc6056_f? Is generation IPv4 Identification necessary? Can be generation IPv4 Identification more optimized? For example, generate random Identification when bib entry is created and then just increment it with every packet? Is there anything I can do for better performance results?
Thank you.