Roiocam commented 1 year ago

Recently I am doing load testing on my application. it does not really have a good performance, and even can not reach the bottleneck of the database.

I have some questions on the insert model:

on akka.persistence.jdbc.journal.dao.DefaultJournalDao, it uses multi insert action ++=

https://github.com/akka/akka-persistence-jdbc/blob/bba266cc0b0d398e0b16fa0d1347311f51116b48/core/src/main/scala/akka/persistence/jdbc/journal/dao/DefaultJournalDao.scala#L45

And then I took some time to find the source that executes SQL. The multi-insert action implement is MultiInsertAction on slick.jdbc.JdbcActionComponent

And it basically just uses java.sql.Statement#executeBatch. as far as I know it wouldn't optimize too much on the multi-insert on this method. (Most of the JDBC drivers would not).

They just for-each loop every param and fill them to statement and then for-each loop executes it, it will take huge network IO (the bottleneck).

MySQL document points out how to optimize this. https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html

General, most database support insert values list, it is just one network i/o.

I believe it is a performance issue, and break down the server when the network was degraded. (Expressed as a lot of EventSourcedBehavior complain circuit breakers fail fast).

In the end, In the process of writing down this issue (collecting information), I found the same issue already exists, but on scala slick.

https://github.com/slick/slick/pull/2398 and https://github.com/slick/slick/issues/1272

Roiocam commented 1 year ago

Skepticism is pointless. After a thorough investigation, it was found that akka-persistence-jdbc/slick does not even use Driver's executeBatch, and the Oldest node was unexpectedly has longer IO time (they are all running in Kubernetes, as Pods of a certain Deployment).

normal/leader

Screenshot 2023-04-12 at 10 03 11

degradation/oldest

Screenshot 2023-04-12 at 10 04 17

reduce active pods to single

active pods

inactive pods

Roiocam commented 1 year ago

Conclusion

After investigation, I found that once the event of persistence Actor is tagged, the delay of persistence to the database will increase very seriously. For some nodes, it may be 10x time.

the reason why DAO degraded because when event has tag, it's will return to foreach insert, not batch insert.

Slick not support insertAndReturn using batch.

what happen on my case

According to the source code, batches insert event with tag will definitely cause performance degradation.

In my case, it happened randomly, and I verified this problem under cluster and stand-alone respectively.

Generally speaking, if the traffic given to akka-persistence-jdbc is lower than its instantaneous process capacity, then the batch will start, and akka-persistence-jdbc will also start to downgrade.

In my application, due to sharding (my guess), some traffic may be delayed to reach other nodes, which causes the node to receive both HTTP load balancing traffic and traffic from the cluster sharding, resulting in traffic peaks, and resulting in the downgrade of akka-persistence-jdbc.

Roiocam commented 1 year ago

i think this was related on #592

octonato commented 1 year ago

@Roiocam, thanks for reporting this.

I'm curious to see the results of the same load test after adding the PK/FK as described in #592.

Roiocam commented 1 year ago

@Roiocam, thanks for reporting this.

I'm curious to see the results of the same load test after adding the PK/FK as described in #592.

yep, i'm working on it.

I fork this repository and change the source code base on #592, packaged it and uploaded it to our Maven repository (I am not familiar with scala, sbt publish stuck me most of the time, and finally solved it through maven deploy-file)

I am using same load test for this, but has more case:

CASE 1
- hardware: 4c8g (2 of node with profiler)
- persit actor num: 1000
- throughput: 750
CASE 2
- hardware: 4c8g (2 of node with profiler)
- persit actor num: 1000
- concurrency: 750
CASE 3
- hardware: 4c8g (4 of node, and 2 of them with profiler)
- persit actor num: 1000
- throughput: 4000
CASE 4
- hardware: 4c8g (4 of node, no profiler)
- persit actor num: 4000
- throughput: 4000

In case 3, the profiler CPU hotspot report tells us that #592 can solve this problem!

Screenshot 2023-04-13 at 15 33 19

And my application monitor prove the same conclusion.

"Slice Executor" duration metrics in all tests (no more peaks happen on load):

After fixing the batch insertion problem, my application throughput increased from 1000 up to 4000(Kryo serialization is the next bottleneck for now.)

finally, this is my change of #592, i am not familiar with scala, so maybe i missing some blind spot, And one more thing, in my tests, i did't using projection.

On my opinion, foreach insert performance is bad, we should avoid using something like insert and return.

PR show my change: #731

akka / akka-persistence-jdbc

batch insert event and event_tag performance. #710

normal/leader

degradation/oldest

reduce active pods to single

active pods

inactive pods

Conclusion

Slick not support insertAndReturn using batch.

what happen on my case