Closed sbernard31 closed 6 years ago
Removing call to new Throwable sounds to improve it a bit but this seems not so smooth than before.
Do you remember, if there are new Throwable
left, if Exchange.DEBUG
is false?
As far as I know there is only 2 new Throwable
and both can be avoided setting Exchange.DEBUG
to false.
Some of the memory usage seems to be caused by over-use of AtomicReference
.
I introduced them before I recognised, that "more" synchronization is required.
Did you plan to push a commit with less AtomicReference ? Or do you prefer thant I try to do it on my side ? Then I can launch my tests.
I currently play with that. I guess I will push beginning next week.
The gc/heap stuff seems to be improved with using a smaller stripe object and not the Exchange itself. The performance penalty of the StripedExecutor is still the case.
See new commits in #551
I looked at the github version of stripedExecutor and it uses an IdentityHashMap instead of HashMap
Maybe this could have impact and in our case I think we want to use ==
for Exchange and not equals
(This remarks is maybe obsolete, with your last changes)
I also began to discuss about performance issue with the creator of StripedExecutorService. https://github.com/kabutz/javaspecialists/issues
I will test tomorrow the changes you done in #551.
I guess, the HashMap
is used because of scandium, which uses InetSocketAddress
as stripe.
My change only improves the heap/gc consumption, the execution time still get's a penalty for the stripes :-).
My last results for Test1:
Interpretations: It confirms that "a smaller stripe object and not the Exchange itself" behaves better in terms of Heap Memory/GC behavior. Slightly better in terms of CPU and request/seconds too.
Removing StripedExecutor
show this globally behaves better in terms of performance (especially GC cpu) but this is still less than the reference test.
Any further progress?
Nope, I tried several little changes without real impact. I plan to try to reproduce a test environment near than what we have in production to see how this behaves.
I plan to try to reproduce a test environment near than what we have in production to see how this behaves.
Any news?
FMPOV, having a "small" performance penalty (it will be even smaller, if coaps, larger messages and/or deduplication is used), is the smaller problem than having leaks.
I'm working on it but first tests seems to show that performance penalty is limited. And I agree that fixing leaks is more valuable.
@sbernard31 Are you still on it? Or could it be closed?
We can close it. I will reopen it if needed.
Hi there, why closed it , was it fixed in new code? seems we met the same issue. The cpus are not busy, and lots of DTLS-Connection-Handler are waiting for task.
Cpu: 6.7, threadId: 310827, "DTLS-Connection-Handler-2" #120 daemon prio=5 os_prio=0 tid=0x00000000016f2000 nid=0x4be2b waiting on condition [0x00007f8b4e6e7000] Cpu: 6.7, threadId: 310828, "DTLS-Connection-Handler-3" #121 daemon prio=5 os_prio=0 tid=0x00000000016f5000 nid=0x4be2c runnable [0x00007f8b4e5e6000] Cpu: 6.7, threadId: 310832, "DTLS-Connection-Handler-7" #125 daemon prio=5 os_prio=0 tid=0x00000000016fc800 nid=0x4be30 waiting on condition [0x00007f8b4e1e2000] Cpu: 6.7, threadId: 310834, "DTLS-Connection-Handler-9" #127 daemon prio=5 os_prio=0 tid=0x0000000001700800 nid=0x4be32 waiting on condition [0x00007f8b4dfe0000] Cpu: 6.7, threadId: 310842, "DTLS-Connection-Handler-16" #135 daemon prio=5 os_prio=0 tid=0x000000000170e800 nid=0x4be3a waiting on condition [0x00007f8b4d7d8000] Cpu: 6.7, threadId: 310844, "DTLS-Connection-Handler-18" #137 daemon prio=5 os_prio=0 tid=0x0000000001712000 nid=0x4be3c waiting on condition [0x00007f8b4d5d6000] Cpu: 6.7, threadId: 310846, "DTLS-Connection-Handler-20" #139 daemon prio=5 os_prio=0 tid=0x0000000001715800 nid=0x4be3e waiting on condition [0x00007f8b4d3d4000] Cpu: 6.7, threadId: 310849, "DTLS-Connection-Handler-23" #142 daemon prio=5 os_prio=0 tid=0x000000000171b000 nid=0x4be41 waiting on condition [0x00007f8b4d0d1000] Cpu: 6.7, threadId: 310852, "DTLS-Connection-Handler-26" #145 daemon prio=5 os_prio=0 tid=0x0000000001720000 nid=0x4be44 waiting on condition [0x00007f8b4cdce000] Cpu: 6.7, threadId: 310865, "DTLS-Connection-Handler-39" #158 daemon prio=5 os_prio=0 tid=0x0000000001738800 nid=0x4be51 waiting on condition [0x00007f8b47ffe000] Cpu: 6.7, threadId: 310868, "DTLS-Connection-Handler-42" #161 daemon prio=5 os_prio=0 tid=0x000000000173e800 nid=0x4be54 waiting on condition [0x00007f8b47cfb000]
The cpus are not busy, and lots of DTLS-Connection-Handler are waiting for task.
Maybe you can give more information about your setup?
The benchmark results above are from plain coap, so it's no wonder that the "DTLS-Connection-Handler" are waiting, they have just nothing to do.
@boaks is right your issue is not linked to this performances tests.
You should open a new issue and give us more information to help to understand. (e.g. a thread dump to see stack trace and see on which condition threads are blocked could help)
Hi, boaks, thanks for the quick response. we benchmark the coap server which support both coap and coaps with different two different ports in a 8 vCPUs server, and simulate about 1000 coapclients (actually, whatever how many, we have excluded the clients cause this issue) with jmeter , the server response a hello immediately. The throughput for coap is around 40K but coaps is around 16k, and we found many UDP packets was dropped in server, seems the coap server can not handle the messages as fase as expect. and then we found most of DTLS-Connection-Handler are free. we check out the code, we doubt the bottleneck is the StripedExecutorService which introduce a global lock that was reported by sbernard31 before (https://github.com/kabutz/javaspecialists/issues/1), but we can't sure.
How many "ConnectionThread"s do you use for the DTLSConnector?
it's default, should be 6 * 8Cpu = 48
So let me point out: This issue was created to check, if the introduction of the striped exchange execution (PR #590) causes some unacceptable resource/performance penalty. The heap consumption was optimized easily (based on simon's first results), but for the CPU penalty of the "stripped execution" I didn't find such an easy optimization (my guess is, that the penalty is not based on the stripped lock, but I have no analysis on that). So our conclusion was, that the penalty is acceptable (also, because nobody was able to spend time into implement alternatives).
For a more general performance discussion I updated the demos and pushed the current benchmark setup to PR #607.
Using that, I start the cf-extplugtest-server/ExtendedTestServer with:
-d64 -Xmx4g -XX:+UseG1GC -onlyDtlsLoopback -noPlugtest
and the cf-extplugtest-client/BenchmarkClient
-d64 -Xmx4g -XX:+UseG1GC coaps://localhost:5784/benchmark?rlen=200 1000 5000
My results are:
1000 benchmark clients finished. 5000000 requests sent, 5000000 expected 5000000 requests in 292066ms, 17119 reqs/s 2172 retransmissions (0,04%)
My CPU monitor shows, that the CPU is almost busy, so I don't see, that there are principal "bottlenecks" caused by "waiting threads". For an improvement in that particular scenario, it would be required to find something "waiting busy", but I hope this is not the case.
So, maybe you try to adapt your benchmark setup closer to PR #607 and we will see, if there could be something found, to improve the performance. But I agree with @sbernard31, that this general issue is only related to particular issue. So if you want to provide your results, please open an other issue.
@joechen2010 Did you try my experimental "lockless" implementation of StripedExecutor ? Personally, I didn't find any positive impact with my tests, that's why I leave this way for now :/
Thanks for your comments. @sbernard31 Yes, we tried that but seems nothing change. @boaks We will try that setup later.
Hi, we found the main performance penalty is the decrypt the message in CCMBlockCipher. the cipher init every message and it 's a synchronized action, should be move to DTLSession when it creating.
Cipher cipher = Cipher.getInstance(CIPHER_NAME); cipher.init(Cipher.ENCRYPT_MODE, new SecretKeySpec(key, KEY_TYPE));
This issue aims to regroup discussion/tests about performance analysis of #551.
Test 1
Protocol: testing
cf-benchmark
server behavior. (as it does not changed in #551) and see how modifications impact it.using
cf-extplugtest-client
as client (I always use the same version of the client commit 4e868560a22e9052630f94289ec240aa4c9b4c96, only version of server change)For each run, I launched client command 3 times in a row. (For visual VM chart I terminate performing a manual GC).
Results:
Using cf-benchmark before #551. (commit 4e868560a22e9052630f94289ec240aa4c9b4c96)
``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 354436 requests (35443 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 808318 requests (45388 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 24445ms, 40908 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 24445ms, 0 notifies/s 249 clients with 1983 to 1999 requests. 250 clients with 2000 to 2019 requests. 1 clients with 2020 requests. =========================================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 360784 requests (36078 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 835320 requests (47453 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 23803ms, 42011 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 23803ms, 0 notifies/s 246 clients with 1981 to 1999 requests. 250 clients with 2000 to 2019 requests. 4 clients with 2020 to 2027 requests. ============================================================= Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 397611 requests (39761 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 858955 requests (46134 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 23790ms, 42034 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 23790ms, 0 notifies/s 6 clients with 1978 to 1979 requests. 270 clients with 1980 to 1999 requests. 194 clients with 2000 to 2019 requests. 23 clients with 2020 to 2037 requests. 7 clients with 2042 to 2059 requests ```Using cf-benchmark from #551 (commit 54828e74ba42491f5d544480ece89123cde022a5 rebased on 4e868560a22e9052630f94289ec240aa4c9b4c96 to benefit of maven log fix)
``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 321568 requests (32156 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 705669 requests (38410 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27755ms, 36029 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27755ms, 0 notifies/s 265 clients with 1981 to 1999 requests. 218 clients with 2000 to 2019 requests. 17 clients with 2020 to 2036 requests. =================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 272947 requests (27294 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 585271 requests (31232 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 933816 requests (34854 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 32561ms, 30711 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 32561ms, 0 notifies/s 1 clients with 1978 requests. 259 clients with 1983 to 1999 requests. 228 clients with 2000 to 2019 requests. 12 clients with 2020 to 2033 requests. =============================================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 330138 requests (33013 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 698271 requests (36813 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 29033ms, 34443 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 29033ms, 0 notifies/s 243 clients with 1982 to 1999 requests. 254 clients with 2000 to 2018 requests. 3 clients with 2020 to 2027 requests. ```Using cf-benchmark from #551 removing all `new Throwable`
``` Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 275643 requests (27564 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 717956 requests (44231 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27212ms, 36748 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27212ms, 0 notifies/s 243 clients with 1982 to 1999 requests. 256 clients with 2000 to 2019 requests. 1 clients with 2026 requests. ================================ Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 335922 requests (33592 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 734060 requests (39813 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27296ms, 36635 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27296ms, 0 notifies/s 248 clients with 1982 to 1999 requests. 247 clients with 2000 to 2019 requests. 5 clients with 2021 to 2033 requests. =============================== Create 500 benchmark clients, expect to send 1000000 request overall to coap://localhost:5683/benchmark Benchmark clients created. Benchmark started. 317951 requests (31795 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) 713891 requests (39594 reqs/s, 0 retransmissions (0,00%), 0 transmission errors (0,00%), 500 clients) Benchmark clients finished. 1000000 requests sent, 1000000 expected 1000000 requests in 27423ms, 36465 reqs/s 0 notifies sent, 0 expected, 0 observe requests 0 notifies in 27423ms, 0 notifies/s 275 clients with 1980 to 1999 requests. 199 clients with 2000 to 2019 requests. 25 clients with 2020 to 2036 requests. 1 clients with 2046 requests. ```Interpretations: We note there is more CPU activities and especially more GC activities. It seems we also allocate more heap space. The number of request/seconds seems lower too.
Removing call to
new Throwable
sounds to improve it a bit but this seems not so smooth than before.