Rantanen / node-dtls

JavaScript DTLS implementation for Node.js
ISC License
58 stars 15 forks source link

Performance benchmark? #1

Closed guymguym closed 9 years ago

guymguym commented 9 years ago

Hey, This is a cool project :+1:

Any idea about the performance compared to native OpenSSL?

I did OpenSSL benchmark using the code of "DTLS Character Generator Server and Client" from http://sctp.fh-muenster.de/dtls-samples.html and tested on MacBook Air i5 and got around 1600 MB/s.

Here are the instructions:

$ curl http://sctp.fh-muenster.de/dtls/dtls_udp_chargen.c > dtls_udp_chargen.c
$ gcc -o dtls_udp_chargen dtls_udp_chargen.c -lcrypto -lssl -w 
$ mkdir certs
$ openssl genrsa > certs/server-key.pem
$ openssl genrsa > certs/client-key.pem
$ openssl req -new -x509 -days 365 -key certs/server-key.pem -out certs/server-cert.pem
$ openssl req -new -x509 -days 365 -key certs/client-key.pem -out certs/client-cert.pem
$ ./dtls_udp_chargen -L 127.0.0.1 &
$ ./dtls_udp_chargen -l $((8*1024)) 127.0.0.1

Statistics:
========================================

Sent messages:                    205200
Received messages:                  2953

Messages lost:                         0

Renegotiations initiated:              7

$ echo $((205200*8*1024/1024/1024))
1603
Rantanen commented 9 years ago

That example doesn't seem to do anything for me. dtls_udp_chargen exits immediately no matter what I do. I guess I need to come up with my own performance test once I get client authentication to work.

I'll try to get some benchmarking data out once the project is more complete. However below is a brief overview of message handling in node-dtls for application data:

  1. dgram.Socket message event -> DtlsServer#_onMessage
  2. DtlsServer#_onMessage does an object lookup for the DtlsSocket based on rinfo.address and rinfo.port and passes the packet to DtlsSocket#handle.
  3. DtlsSocket#handle passes the packet to DtlsRecordLayer#getPackets for decrypting.
  4. DtlsRecordLayer#getPackets reads the packet header (A couple of buffer.readXXX) and calls DtlsRecordLayer#decrypt:
    1. Buffer#slice the packet a couple of times to extract IV/Ciphercontent.
    2. Create Node crypto.Decipher with the correct parameters.
    3. cipher.update( ciphertext ), cipher.final()
    4. More Buffer#slice to extract the MAC
    5. crypto.createHmac, one more Buffer#slice (that last one could be skipped if needed, but slice is a fast method anyway as far as I know).
    6. Verify MAC with Buffer#equals.
  5. After decrypt DtlsRecordLayer#getPackets uses callback to return the decrypted packet to DtlsSocket#handle
    1. DtlsSocket#handle uses the packet type to do a function lookup for the processing function
    2. DtlsSocket#process_applicationData emits the message
  6. If there is more data available in the UDP packet DtlsRecordLayer#getPacket continues from 4.

So there's a couple of function calls and two object lookups. The biggest performance waste is probably turning the UDP packet into DtlsPlaintext as I'm using my own generic library implementation instead of raw Buffer#readXXX (Examples). Although even this shouldn't take too long given there's only one nested structure (ProtocolVersion) within the DtlsPlaintext.

Most of the heavy lifting (Deciphering, Hmac calculation) is done inside OpenSSL using Node's crypto module.

It's not going to reach the 1600 MB/s numbers on your machine though given Node is inherently single threaded and your example was multithreaded I believe. However now that I look at the call path listed above, I don't think it's should be that much slower than a fully native (singlethreaded) module - assuming the bigget performance bottleneck is calculating the plaintext/MAC.

Rantanen commented 9 years ago

Made a small throughput test (examples/throughput.js):

server.onMessage = echo packet back client.onMessage = count++; echo packet back

1300 counts/s, 1000 bytes of data per packet. So roughly 1.3MB/s both ways.

It's a bit disappointing, but will see how much I can get by optimizing it a bit. One easy thing I noted was 25% performance increase (not included in numbers above) when I replaced dtls.connect( .., 'localhost', .. ) with dtls.connect( .., '127.0.0.1', .. ). Shows how tight that execution is when simple dns query that shouldn't even hit the network takes 1/5th of execution time.

With these initial numbers in, I'm closing this issue. I'll try to get some metrics to the readme once I'm done optimizing.

guymguym commented 9 years ago

hey,

the difference between localhost and 127.0.0.1 is also noticeable by other benchmarks.

try increasing the packet size significantly. 16K or so. udp and certainly dtls have cpu overhead per packet for going into the stack (node -> v8 -> libuv -> kernel). increasing the packet size will reduce the overhead by a factor. also notice that using all these buffer slices creates multiple buffer objects per packet that require garbage collection.

I believe that a high throughput solution would have to use native code that will be able to directly dereference to the packet memory and reduce the context-switches between stack node->v8 and also create much less work for garbage collector per packet.

btw it fails with node 0.12 and 0.11 like this:

$ node example/throughput.js 
~/node-dtls/packets/PacketSpec.js:213
            builder[ 'writeUInt' + length ]( value.length );
                                                  ^
TypeError: Cannot read property 'length' of null
    at null.<anonymous> (~/node-dtls/packets/PacketSpec.js:213:51)
    at Function.PacketSpec.writeItem (~/node-dtls/packets/PacketSpec.js:143:28)
    at null.<anonymous> (~/node-dtls/packets/PacketSpec.js:219:28)
    at Function.PacketSpec.writeItem (~/node-dtls/packets/PacketSpec.js:143:28)
    at PacketSpec.write (~/node-dtls/packets/PacketSpec.js:76:20)
    at Packet.getBuffer (~/node-dtls/packets/Packet.js:21:22)
    at HandshakeBuilder.createHandshakes (~/node-dtls/HandshakeBuilder.js:31:26)
    at HandshakeBuilder.createHandshakes (~/node-dtls/HandshakeBuilder.js:27:50)
    at ServerHandshakeHandler.send_serverHello (~/node-dtls/ServerHandshakeHandler.js:191:44)
    at ServerHandshakeHandler.process (~/node-dtls/ServerHandshakeHandler.js:94:20)

and with node 0.10 like this:

$ node example/throughput.js 

dgram.js:87
  throw new Error('Bad socket type specified. Valid types are: udp4, udp6');
        ^
Error: Bad socket type specified. Valid types are: udp4, udp6
    at newHandle (dgram.js:87:9)
    at new Socket (dgram.js:112:16)
    at Object.exports.createSocket (dgram.js:129:10)
    at Object.DtlsServer.createServer (~/node-dtls/DtlsServer.js:26:29)
    at Object.<anonymous> (~/node-dtls/example/throughput.js:18:19)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
Rantanen commented 9 years ago

Huh? Node 0.12 should work:

wace@jubjub:~/projects/node-dtls$ node --version
v0.12.2
wace@jubjub:~/projects/node-dtls$ node example/throughput.js
Server received client#Finished and is ready.
Client received server#Finished and is ready.
Packets:    21363
Size:       1000 B
Time:       15000 ms
Throughput: 1390.8203125 KB/s
wace@jubjub:~/projects/node-dtls$

Pre-0.12 won't work as I'm currently using the publicEncrypt/privateDecrypt methods for key exchange. These methods were added in 0.12 I believe. Might replace these with some library at some point which should get 0.11 working. 0.10 requires a different dgram constructor at least, so that shouldn't be too hard once the crypto methods are replaced.

The packet size of 1000 is a bit conservative, but going past 1500 (Limit of Ethernet v2/PPPoE, etc) makes little sense. Definitely not past 9000 (Ethernet jumbo frames). While I'd get the numbers up a bit, configuration like that isn't feasible for communicating over the internet. At least until we can get MTU discovery.

Rantanen commented 9 years ago

Oh. The Node v0.12 problem might be due to the key file?

server.pem:

-----BEGIN PRIVATE KEY-----
MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQC3dz3nvtn/RGWU
cRcucfUk7qSm2Zlb2v1mtrrLLu4A5GVuNeG8ZObBNnsEXnoZNfJV/rmIPEndktH9
...
EKvW676x4lCHOGlWtNXpLym0HPm6qKkwW+3xSY/yqbLfUC1/iQaEDYXdcpcbQP1z
YI48pPDuxoMiqJDcd6qDdQM=
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
 MIIDXTCCAkWgAwIBAgIJAOzyLnW7y+GuMA0GCSqGSIb3DQEBCwUAMEUxCzAJBgNV
 BAYTAkFVMRMwEQYDVQQIDApTb21lLVN0YXRlMSEwHwYDVQQKDBhJbnRlcm5ldCBX
 ...
 7Br2pLaTuHUwu4sD60BchbLROdltyFrimpcD6T9QlWgVtVW/hpE90m8LmgVNEtg8
 wg==
-----END CERTIFICATE-----

Try the following command:

openssl req -x509 -nodes -newkey rsa:2048 -keyout server.pem -out server.pem
Rantanen commented 9 years ago

And to add some data. Made the same test with raw UDP sockets without DLTS:

wace@jubjub:~/projects/node-dtls$ node example/throughput_raw.js
Packets:    30808
Size:       1000 B
Time:       1500 ms
Throughput: 20057.291666666668 KB/s

So the current DTLS implementation drops the performance to around 5-7% of the maximum.

guymguym commented 9 years ago

OK the openssl command was the missing thing. thanks.

regarding UDP packets size - using larger size than the ethernet MTU should do IP fragmentation by the kernel, right? by that you reduce the user-space fragmentation work and system calls and pass more work to the kernel.

it might be a bit risky though since UDP does not negotiate MTU, and if any switch on the path uses a lower MTU than the sender then packets will be dropped, but in some networks that will do just fine.

guymguym commented 9 years ago

every few runs it fails like this

$ node example/throughput.js 
crypto.js:136
  this._handle.init(hmac, toBuf(key));
               ^
TypeError: Not a buffer
    at TypeError (native)
    at new Hmac (crypto.js:136:16)
    at Object.Hmac (crypto.js:134:12)
    at hmac_hash (~/node-dtls/prf.js:20:22)
    at a (~/node-dtls/prf.js:34:16)
    at p (~/node-dtls/prf.js:41:57)
    at ~/node-dtls/prf.js:74:12
    at SecurityParameters.init (~/node-dtls/SecurityParameters.js:78:39)
    at SecurityParameterContainer.changeCipher (~/node-dtls/SecurityParameterContainer.js:36:18)
    at DtlsRecordLayer.getPackets (~/node-dtls/DtlsRecordLayer.js:58:29)
Rantanen commented 9 years ago

Huh? That shouldn't happen. :D

Oh well. That's most likely part of the retransmission/reordering things. That never happened with me, but it could be network stack related anyway. I think there's still the possible issue if KeyExchange, ChangeCipher, Finished messages manage to change order in such a way that Finished is read before KeyExchange.

These fail-scenarios are still very much work in progress, so I'll be improving this when I get around to it. The general handshake reordering is already in place, but this issue happens in the record layer.

guymguym commented 9 years ago

it actually happens to me quite a lot when I run the test over and over. it happens more than it doesn't :)