Rantanen / node-dtls

JavaScript DTLS implementation for Node.js
ISC License
58 stars 15 forks source link

example/send_throughput with compare to plain udp #3

Closed guymguym closed 9 years ago

guymguym commented 9 years ago

I took the example/throughput.js and changed it to only send from one direction, in batches, and the receiver returns small acks only to trigger the next batch send.

it can accept arguments using --udp and --size 1000 etc.

see the comparison of throughput - this is with plain UDP:

$ node example/send_throughput.js --udp 
Using Arguments { _: [],
  udp: true,
  integrity: false,
  size: 8000,
  batch: 20,
  acktime: 20,
  port: 23395,
  time: 5000 } 

Client connected.
Server connected.
..............................................

Sent Packets     : 92620
Received Packets : 92586
Acks             : 4632 (2 timedout)
Size             : 8000 B
Time             : 5000 ms
Throughput       : 144665.625 KB/s

and now with DTLS:

$ node example/send_throughput.js 
Using Arguments { _: [],
  integrity: false,
  size: 8000,
  batch: 20,
  acktime: 20,
  udp: false,
  port: 23395,
  time: 5000 } 

Server connected.
Client connected.
...

Sent Packets     : 7880
Received Packets : 7590
Acks             : 395 (19 timedout)
Size             : 8000 B
Time             : 5000 ms
Throughput       : 11859.375 KB/s
Rantanen commented 9 years ago

Did some testing with this:

Currently there is 36% overhead caused by DTLS layering (DTLS envelope, eventing, etc). ie. unencrypted packet with UDP takes 34.3 ns, while DTLS packet (without encryption) takes 46.6 ns. Encrypted packet takes 232.9 ns. So even if we were to get the unencrypted overhead down from 46.6 ns to 34.3 ns (12.3 ns), we'd still have encrypted packets taking 220.6 ns, which is 5% increase in throughput.

When running parallel processes, the throughput scaled linearly up to 4 processes gaining roughtly 30M/s per process. From there to up to 8 processes the scaling dropped to 15M/s per process above 4. From 8 upwards the throughput flattened out. Given this was executed on an octacore CPU, it would seem clear it's a CPU heavy operation, utilizing two threads. The same behavior was visible on the raw UDP sockets.

dtls
Blue: DTLS, Green: raw UDP

Based on this, I'm somewhat satisfied with my own code. The bottleneck seems to be the encrypt/decrypt methods, which are a bit out of my hands at the moment. Switching the implementation to pure javascript might gain performance as we'd save the overhead of native calls - but at the same time it might lose performance as OpenSSL itself is pretty optimized for crypto functions. The stuff that would be easy for me to optimize would gain 5% throughput increase, which I don't deem worth it at this point.

guymguym commented 9 years ago

if you are using Mac then Instruments Time Profiler gives amazing info.

for example in this run with DTLS:

image

you can see that there are 4 libuv threads consuming total of 16% (4*4%) doing mostly uv__getaddrinfo_work ... so something about DNS is working very hard there.

for the main thread you can see that:

image

for comparison with UDP ~ 30% is spent on sendmsg/recvmsg, and the rest is on memory allocation (madvise)

image

guymguym commented 9 years ago

this is actually very accurate - since the UDP benchmark pushes 10x more packets than with DTLS and you can see that same 10x ratio in the profiler time spent in sendmsg/recvmsg... so the more you do other stuff, the less you can push.