Closed Roiocam closed 10 months ago
Skepticism is pointless. After a thorough investigation, it was found that akka-persistence-jdbc/slick does not even use Driver's executeBatch
, and the Oldest node was unexpectedly has longer IO time (they are all running in Kubernetes, as Pods of a certain Deployment).
After investigation, I found that once the event of persistence Actor is tagged, the delay of persistence to the database will increase very seriously. For some nodes, it may be 10x time.
the reason why DAO degraded because when event has tag, it's will return to foreach insert, not batch insert.
According to the source code, batches insert event with tag will definitely cause performance degradation.
In my case, it happened randomly, and I verified this problem under cluster and stand-alone respectively.
Generally speaking, if the traffic given to akka-persistence-jdbc
is lower than its instantaneous process capacity, then the batch will start, and akka-persistence-jdbc will also start to downgrade.
In my application, due to sharding (my guess), some traffic may be delayed to reach other nodes, which causes the node to receive both HTTP load balancing traffic
and traffic from the cluster sharding
, resulting in traffic peaks, and resulting in the downgrade of akka-persistence-jdbc
.
i think this was related on #592
@Roiocam, thanks for reporting this.
I'm curious to see the results of the same load test after adding the PK/FK as described in #592.
@Roiocam, thanks for reporting this.
I'm curious to see the results of the same load test after adding the PK/FK as described in #592.
yep, i'm working on it.
I fork this repository and change the source code base on #592, packaged it and uploaded it to our Maven repository (I am not familiar with scala, sbt publish stuck me most of the time, and finally solved it through maven deploy-file)
I am using same load test for this, but has more case:
hardware:
4c8g (2 of node with profiler)persit actor num:
1000throughput:
750hardware:
4c8g (2 of node with profiler)persit actor num:
1000concurrency:
750hardware:
4c8g (4 of node, and 2 of them with profiler)persit actor num:
1000throughput:
4000hardware:
4c8g (4 of node, no profiler)persit actor num:
4000throughput:
4000In case 3, the profiler CPU hotspot report tells us that #592 can solve this problem!
And my application monitor prove the same conclusion.
"Slice Executor" duration metrics in all tests (no more peaks happen on load):
After fixing the batch insertion problem, my application throughput increased from 1000 up to 4000(Kryo serialization is the next bottleneck for now.)
finally, this is my change of #592, i am not familiar with scala, so maybe i missing some blind spot, And one more thing, in my tests, i did't using projection.
On my opinion, foreach insert
performance is bad, we should avoid using something like insert and return
.
PR show my change: #731
Recently I am doing load testing on my application. it does not really have a good performance, and even can not reach the bottleneck of the database.
I have some questions on the insert model:
on
akka.persistence.jdbc.journal.dao.DefaultJournalDao
, it uses multi insert action++=
https://github.com/akka/akka-persistence-jdbc/blob/bba266cc0b0d398e0b16fa0d1347311f51116b48/core/src/main/scala/akka/persistence/jdbc/journal/dao/DefaultJournalDao.scala#L45
And then I took some time to find the source that executes SQL. The multi-insert action implement is
MultiInsertAction
onslick.jdbc.JdbcActionComponent
And it basically just uses
java.sql.Statement#executeBatch
. as far as I know it wouldn't optimize too much on the multi-insert on this method. (Most of the JDBC drivers would not).They just for-each loop every param and fill them to statement and then for-each loop executes it, it will take huge network IO (the bottleneck).
MySQL document points out how to optimize this. https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html
General, most database support insert values list, it is just one network i/o.
I believe it is a performance issue, and break down the server when the network was degraded. (Expressed as a lot of EventSourcedBehavior complain circuit breakers fail fast).
In the end, In the process of writing down this issue (collecting information), I found the same issue already exists, but on scala slick.
https://github.com/slick/slick/pull/2398 and https://github.com/slick/slick/issues/1272