sbernard31 commented 6 years ago

This issue aims to regroup discussion/tests about performance analysis of #551.

Test 1

Protocol: testing cf-benchmark server behavior. (as it does not changed in #551) and see how modifications impact it.

java -Xms4096m -Xmx4096m -jar cf-benchmark-2.0.0-SNAPSHOT.jar

using cf-extplugtest-client as client (I always use the same version of the client commit 4e868560a22e9052630f94289ec240aa4c9b4c96, only version of server change)

java -Xms4096m -Xmx4096m -cp cf-extplugtest-client-2.0.0-SNAPSHOT.jar  org.eclipse.californium.extplugtests.BenchmarkClient "coap://localhost:5683/benchmark" 500 2000

For each run, I launched client command 3 times in a row. (For visual VM chart I terminate performing a manual GC).

Results:

Using cf-benchmark before #551. (commit 4e868560a22e9052630f94289ec240aa4c9b4c96)

``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 354436 requests (35443 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 808318 requests (45388 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 24445ms, 40908 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 24445ms, 0 notifies/s 249 clients with 1983 to 1999 requests. 250 clients with 2000 to 2019 requests. 1 clients with 2020 requests. =========================================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 360784 requests (36078 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 835320 requests (47453 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 23803ms, 42011 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 23803ms, 0 notifies/s 246 clients with 1981 to 1999 requests. 250 clients with 2000 to 2019 requests. 4 clients with 2020 to 2027 requests. ============================================================= Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 397611 requests (39761 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 858955 requests (46134 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 23790ms, 42034 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 23790ms, 0 notifies/s 6 clients with 1978 to 1979 requests. 270 clients with 1980 to 1999 requests. 194 clients with 2000 to 2019 requests. 23 clients with 2020 to 2037 requests. 7 clients with 2042 to 2059 requests ```

serverref

Using cf-benchmark from #551 (commit 54828e74ba42491f5d544480ece89123cde022a5 rebased on 4e868560a22e9052630f94289ec240aa4c9b4c96 to benefit of maven log fix)

``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 321568 requests (32156 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 705669 requests (38410 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27755ms, 36029 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27755ms, 0 notifies/s 265 clients with 1981 to 1999 requests. 218 clients with 2000 to 2019 requests. 17 clients with 2020 to 2036 requests. =================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 272947 requests (27294 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 585271 requests (31232 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 933816 requests (34854 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 32561ms, 30711 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 32561ms, 0 notifies/s 1 clients with 1978 requests. 259 clients with 1983 to 1999 requests. 228 clients with 2000 to 2019 requests. 12 clients with 2020 to 2033 requests. =============================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 330138 requests (33013 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 698271 requests (36813 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 29033ms, 34443 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 29033ms, 0 notifies/s 243 clients with 1982 to 1999 requests. 254 clients with 2000 to 2018 requests. 3 clients with 2020 to 2027 requests. ```

server551

Using cf-benchmark from #551 removing all `new Throwable`

``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 275643 requests (27564 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 717956 requests (44231 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27212ms, 36748 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27212ms, 0 notifies/s 243 clients with 1982 to 1999 requests. 256 clients with 2000 to 2019 requests. 1 clients with 2026 requests. ================================ Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 335922 requests (33592 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 734060 requests (39813 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27296ms, 36635 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27296ms, 0 notifies/s 248 clients with 1982 to 1999 requests. 247 clients with 2000 to 2019 requests. 5 clients with 2021 to 2033 requests. =============================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 317951 requests (31795 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 713891 requests (39594 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27423ms, 36465 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27423ms, 0 notifies/s 275 clients with 1980 to 1999 requests. 199 clients with 2000 to 2019 requests. 25 clients with 2020 to 2036 requests. 1 clients with 2046 requests. ```

server551withoutthrowable

Interpretations: We note there is more CPU activities and especially more GC activities. It seems we also allocate more heap space. The number of request/seconds seems lower too.

Removing call to new Throwable sounds to improve it a bit but this seems not so smooth than before.

boaks commented 6 years ago

Removing call to new Throwable sounds to improve it a bit but this seems not so smooth than before.

Do you remember, if there are new Throwable left, if Exchange.DEBUG is false?

sbernard31 commented 6 years ago

As far as I know there is only 2 new Throwable and both can be avoided setting Exchange.DEBUG to false.

boaks commented 6 years ago

Some of the memory usage seems to be caused by over-use of AtomicReference. I introduced them before I recognised, that "more" synchronization is required.

sbernard31 commented 6 years ago

Did you plan to push a commit with less AtomicReference ? Or do you prefer thant I try to do it on my side ? Then I can launch my tests.

boaks commented 6 years ago

I currently play with that. I guess I will push beginning next week.

boaks commented 6 years ago

The gc/heap stuff seems to be improved with using a smaller stripe object and not the Exchange itself. The performance penalty of the StripedExecutor is still the case.

boaks commented 6 years ago

See new commits in #551

sbernard31 commented 6 years ago

I looked at the github version of stripedExecutor and it uses an IdentityHashMap instead of HashMap Maybe this could have impact and in our case I think we want to use == for Exchange and not equals (This remarks is maybe obsolete, with your last changes)

I also began to discuss about performance issue with the creator of StripedExecutorService. https://github.com/kabutz/javaspecialists/issues

I will test tomorrow the changes you done in #551.

boaks commented 6 years ago

I guess, the HashMap is used because of scandium, which uses InetSocketAddress as stripe. My change only improves the heap/gc consumption, the execution time still get's a penalty for the stripes :-).

sbernard31 commented 6 years ago

My last results for Test1:

Using cf-benchmark from #551 with last modification (commit dca4c1d94b1eb996edcd13cbdde734397a5c3140) and DEBUG = false (so without` new Throwable`

``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 349419 requests (34941 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 775733 requests (42631 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 26009ms, 38448 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 26009ms, 0 notifies/s 2 clients with 1973 to 1978 requests. 278 clients with 1981 to 1999 requests. 190 clients with 2000 to 2019 requests. 28 clients with 2020 to 2038 requests. 2 clients with 2051 to 2053 requests. ============================ Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 343985 requests (34398 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 768012 requests (42402 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 26262ms, 38077 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 26262ms, 0 notifies/s 3 clients with 1975 to 1979 requests. 260 clients with 1980 to 1999 requests. 213 clients with 2000 to 2019 requests. 22 clients with 2020 to 2039 requests. 1 clients with 2044 requests. 1 clients with 2071 requests. ============================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 332781 requests (33278 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 755433 requests (42265 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 25993ms, 38471 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 25993ms, 0 notifies/s 1 clients with 1979 requests. 247 clients with 1980 to 1999 requests. 246 clients with 2000 to 2019 requests. 6 clients with 2020 to 2024 requests. ```

server551modif1withoutthrowable

Using cf-benchmark from #551 with last modification (commit dca4c1d94b1eb996edcd13cbdde734397a5c3140) and DEBUG = false (so without` new Throwable` and without using stripedExecutor

``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 383125 requests (38312 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 820346 requests (43722 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 24557ms, 40721 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 24557ms, 0 notifies/s 1 clients with 1979 requests. 230 clients with 1984 to 1999 requests. 268 clients with 2000 to 2018 requests. 1 clients with 2020 requests. ================================= Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 352237 requests (35223 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 785943 requests (43370 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 25443ms, 39303 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 25443ms, 0 notifies/s 262 clients with 1981 to 1999 requests. 233 clients with 2000 to 2019 requests. 5 clients with 2022 to 2028 requests. ================================ Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 340208 requests (34020 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 771022 requests (43081 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 25647ms, 38990 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 25647ms, 0 notifies/s 6 clients with 1977 to 1979 requests. 249 clients with 1980 to 1999 requests. 233 clients with 2000 to 2019 requests. 11 clients with 2020 to 2026 requests. 1 clients with 2040 requests. ```

server551modif1withoutthrowablenostriped

Interpretations: It confirms that "a smaller stripe object and not the Exchange itself" behaves better in terms of Heap Memory/GC behavior. Slightly better in terms of CPU and request/seconds too.

Removing StripedExecutor show this globally behaves better in terms of performance (especially GC cpu) but this is still less than the reference test.

boaks commented 6 years ago

Any further progress?

sbernard31 commented 6 years ago

Nope, I tried several little changes without real impact. I plan to try to reproduce a test environment near than what we have in production to see how this behaves.

boaks commented 6 years ago

I plan to try to reproduce a test environment near than what we have in production to see how this behaves.

Any news?

FMPOV, having a "small" performance penalty (it will be even smaller, if coaps, larger messages and/or deduplication is used), is the smaller problem than having leaks.

sbernard31 commented 6 years ago

I'm working on it but first tests seems to show that performance penalty is limited. And I agree that fixing leaks is more valuable.

boaks commented 6 years ago

@sbernard31 Are you still on it? Or could it be closed?

sbernard31 commented 6 years ago

We can close it. I will reopen it if needed.

joechen2010 commented 6 years ago

Hi there, why closed it , was it fixed in new code? seems we met the same issue. The cpus are not busy, and lots of DTLS-Connection-Handler are waiting for task.

Cpu: 6.7, threadId: 310827, "DTLS-Connection-Handler-2" #120 daemon prio=5 os_prio=0 tid=0x00000000016f2000 nid=0x4be2b waiting on condition [0x00007f8b4e6e7000] Cpu: 6.7, threadId: 310828, "DTLS-Connection-Handler-3" #121 daemon prio=5 os_prio=0 tid=0x00000000016f5000 nid=0x4be2c runnable [0x00007f8b4e5e6000] Cpu: 6.7, threadId: 310832, "DTLS-Connection-Handler-7" #125 daemon prio=5 os_prio=0 tid=0x00000000016fc800 nid=0x4be30 waiting on condition [0x00007f8b4e1e2000] Cpu: 6.7, threadId: 310834, "DTLS-Connection-Handler-9" #127 daemon prio=5 os_prio=0 tid=0x0000000001700800 nid=0x4be32 waiting on condition [0x00007f8b4dfe0000] Cpu: 6.7, threadId: 310842, "DTLS-Connection-Handler-16" #135 daemon prio=5 os_prio=0 tid=0x000000000170e800 nid=0x4be3a waiting on condition [0x00007f8b4d7d8000] Cpu: 6.7, threadId: 310844, "DTLS-Connection-Handler-18" #137 daemon prio=5 os_prio=0 tid=0x0000000001712000 nid=0x4be3c waiting on condition [0x00007f8b4d5d6000] Cpu: 6.7, threadId: 310846, "DTLS-Connection-Handler-20" #139 daemon prio=5 os_prio=0 tid=0x0000000001715800 nid=0x4be3e waiting on condition [0x00007f8b4d3d4000] Cpu: 6.7, threadId: 310849, "DTLS-Connection-Handler-23" #142 daemon prio=5 os_prio=0 tid=0x000000000171b000 nid=0x4be41 waiting on condition [0x00007f8b4d0d1000] Cpu: 6.7, threadId: 310852, "DTLS-Connection-Handler-26" #145 daemon prio=5 os_prio=0 tid=0x0000000001720000 nid=0x4be44 waiting on condition [0x00007f8b4cdce000] Cpu: 6.7, threadId: 310865, "DTLS-Connection-Handler-39" #158 daemon prio=5 os_prio=0 tid=0x0000000001738800 nid=0x4be51 waiting on condition [0x00007f8b47ffe000] Cpu: 6.7, threadId: 310868, "DTLS-Connection-Handler-42" #161 daemon prio=5 os_prio=0 tid=0x000000000173e800 nid=0x4be54 waiting on condition [0x00007f8b47cfb000]

boaks commented 6 years ago

The cpus are not busy, and lots of DTLS-Connection-Handler are waiting for task.

Maybe you can give more information about your setup?

The benchmark results above are from plain coap, so it's no wonder that the "DTLS-Connection-Handler" are waiting, they have just nothing to do.

sbernard31 commented 6 years ago

@boaks is right your issue is not linked to this performances tests.

You should open a new issue and give us more information to help to understand. (e.g. a thread dump to see stack trace and see on which condition threads are blocked could help)

joechen2010 commented 6 years ago

Hi, boaks, thanks for the quick response. we benchmark the coap server which support both coap and coaps with different two different ports in a 8 vCPUs server, and simulate about 1000 coapclients (actually, whatever how many, we have excluded the clients cause this issue) with jmeter , the server response a hello immediately. The throughput for coap is around 40K but coaps is around 16k, and we found many UDP packets was dropped in server, seems the coap server can not handle the messages as fase as expect. and then we found most of DTLS-Connection-Handler are free. we check out the code, we doubt the bottleneck is the StripedExecutorService which introduce a global lock that was reported by sbernard31 before (https://github.com/kabutz/javaspecialists/issues/1), but we can't sure.

thread-dump.txt

boaks commented 6 years ago

How many "ConnectionThread"s do you use for the DTLSConnector?

joechen2010 commented 6 years ago

it's default, should be 6 * 8Cpu = 48

boaks commented 6 years ago

So let me point out: This issue was created to check, if the introduction of the striped exchange execution (PR #590) causes some unacceptable resource/performance penalty. The heap consumption was optimized easily (based on simon's first results), but for the CPU penalty of the "stripped execution" I didn't find such an easy optimization (my guess is, that the penalty is not based on the stripped lock, but I have no analysis on that). So our conclusion was, that the penalty is acceptable (also, because nobody was able to spend time into implement alternatives).

For a more general performance discussion I updated the demos and pushed the current benchmark setup to PR #607.

Using that, I start the cf-extplugtest-server/ExtendedTestServer with:

-d64 -Xmx4g -XX:+UseG1GC -onlyDtlsLoopback -noPlugtest

and the cf-extplugtest-client/BenchmarkClient

-d64 -Xmx4g -XX:+UseG1GC coaps://localhost:5784/benchmark?rlen=200 1000 5000

My results are:

1000 benchmark clients finished. 5000000 requests sent, 5000000 expected 5000000 requests in 292066ms, 17119 reqs/s 2172 retransmissions (0,04%)

My CPU monitor shows, that the CPU is almost busy, so I don't see, that there are principal "bottlenecks" caused by "waiting threads". For an improvement in that particular scenario, it would be required to find something "waiting busy", but I hope this is not the case.

So, maybe you try to adapt your benchmark setup closer to PR #607 and we will see, if there could be something found, to improve the performance. But I agree with @sbernard31, that this general issue is only related to particular issue. So if you want to provide your results, please open an other issue.

sbernard31 commented 6 years ago

@joechen2010 Did you try my experimental "lockless" implementation of StripedExecutor ? Personally, I didn't find any positive impact with my tests, that's why I leave this way for now :/

joechen2010 commented 6 years ago

Thanks for your comments. @sbernard31 Yes, we tried that but seems nothing change. @boaks We will try that setup later.

joechen2010 commented 6 years ago

Hi, we found the main performance penalty is the decrypt the message in CCMBlockCipher. the cipher init every message and it 's a synchronized action, should be move to DTLSession when it creating. Cipher cipher = Cipher.getInstance(CIPHER_NAME); cipher.init(Cipher.ENCRYPT_MODE, new SecretKeySpec(key, KEY_TYPE));

eclipse-californium / californium

Perfomance analysis about #551 #559

Test 1