Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
53 stars 47 forks source link

Performance comparison between V3 and V2 #1334

Closed SzymonPobiega closed 6 years ago

SzymonPobiega commented 6 years ago

Complete results available here


Recommendations

SzymonPobiega commented 6 years ago

We run V3+MSMQ setup for almost 24 hours to make sure the throughput does not decrease and there are no memory leaks. The mem consumption of SC was oscillating between 8 and 15 GB. No memory leaks were detected.

We measured sustainable audit processing throughput of 165 msg/s compared to 115 msg/s on similar hardware in V2 which is over 40% increase.

SzymonPobiega commented 6 years ago

We run similar setup as above but with data disk created by striping two 7500 IOPS disks. The measured throughput was the same which suggests that the disk latency, not throughput, is the limitation.

SzymonPobiega commented 6 years ago

We run ASB tests on V2 and V3. The throughput measured was 70 and 155 msg/s respectively, an increase of more than 100%. @danielmarbach was able to get to 800 msg/s on his beefy hardware which suggests that the V3 result of 155 is limited, again, by the disk latency, not the transport.

SzymonPobiega commented 6 years ago

The weekend V3 SQL run showed stable throughput of 160 msg/s in V3.

SzymonPobiega commented 6 years ago

cleanup

Shows the disk behavior when processing 93 KB messages. In this case we might be actually hitting the disk throughput limits with ~300 writes/s and 30 MB of written data per second during ingestion and much more during peaks (cleanup)

SzymonPobiega commented 6 years ago

The large message size tests showed that the size of the message seems effect the ingestion throughput significantly and the effect is not linear due to the threshold used by SC:

The 85KB test run showed much larger memory usage, likely caused by managing in-document message bodies. We need to repeat this run for a longer period of time (overnight) to make sure the mem consumption is large but stable.

SzymonPobiega commented 6 years ago

I've observed super high memory consumption in the 80 KB message run -- up to all machine's resources (60 GB). After SC consumed all the memory it went through a very long GC pause (1 minute) and recovered at around 10 GB of mem consumption.

mikeminutillo commented 6 years ago

Is 85 KB too high a threshold maybe? We were aiming to avoid garbage collection when we implemented that limit. It's possible that some overhead somewhere has pushed us back over that threshold.

SzymonPobiega commented 6 years ago

Here's the mem usage of SC in the mode described above. I don't see any leaks in memory. Only stable allocation in LOH and long GC pauses.

gc

mikeminutillo commented 6 years ago

It was specifically the LOH we were trying to avoid

SzymonPobiega commented 6 years ago

The largest sub-LOH size messages (40 KB) show interesting memory allocation graph

42kb

The allocations start at 8 which is 30 minutes after the SC started processing messages. The audit retention period is set to 30 seconds so it sees like the allocations start when the cleanup job starts evicting messages. This is quite surprising and can't be explained looking at SC code. I suspect the allocations are happening inside RavenDB.

SzymonPobiega commented 6 years ago

The curve seems to flatten...

42kb-2

SzymonPobiega commented 6 years ago

It seems that the cleanup cost goes up with the retention period/database size. This is something we need to further measure and keep in mind when talking to customers. It is likely that over certain DB size (or number of stored documents) the bottleneck moves from the ingestion to cleanup.

udidahan commented 6 years ago

Has the relevant information from this issue been captured in an appropriate MD file somewhere?

SzymonPobiega commented 6 years ago

@udidahan nope. I've added it to the PoA.

SzymonPobiega commented 6 years ago

Created a doco pull here https://github.com/Particular/ServiceControl/pull/1344