Closed SzymonPobiega closed 6 years ago
We run V3+MSMQ setup for almost 24 hours to make sure the throughput does not decrease and there are no memory leaks. The mem consumption of SC was oscillating between 8 and 15 GB. No memory leaks were detected.
We measured sustainable audit processing throughput of 165 msg/s compared to 115 msg/s on similar hardware in V2 which is over 40% increase.
We run similar setup as above but with data disk created by striping two 7500 IOPS disks. The measured throughput was the same which suggests that the disk latency, not throughput, is the limitation.
We run ASB tests on V2 and V3. The throughput measured was 70 and 155 msg/s respectively, an increase of more than 100%. @danielmarbach was able to get to 800 msg/s on his beefy hardware which suggests that the V3 result of 155 is limited, again, by the disk latency, not the transport.
The weekend V3 SQL run showed stable throughput of 160 msg/s in V3.
Shows the disk behavior when processing 93 KB messages. In this case we might be actually hitting the disk throughput limits with ~300 writes/s and 30 MB of written data per second during ingestion and much more during peaks (cleanup)
The large message size tests showed that the size of the message seems effect the ingestion throughput significantly and the effect is not linear due to the threshold used by SC:
The 85KB test run showed much larger memory usage, likely caused by managing in-document message bodies. We need to repeat this run for a longer period of time (overnight) to make sure the mem consumption is large but stable.
I've observed super high memory consumption in the 80 KB message run -- up to all machine's resources (60 GB). After SC consumed all the memory it went through a very long GC pause (1 minute) and recovered at around 10 GB of mem consumption.
Is 85 KB too high a threshold maybe? We were aiming to avoid garbage collection when we implemented that limit. It's possible that some overhead somewhere has pushed us back over that threshold.
Here's the mem usage of SC in the mode described above. I don't see any leaks in memory. Only stable allocation in LOH and long GC pauses.
It was specifically the LOH we were trying to avoid
The largest sub-LOH size messages (40 KB) show interesting memory allocation graph
The allocations start at 8 which is 30 minutes after the SC started processing messages. The audit retention period is set to 30 seconds so it sees like the allocations start when the cleanup job starts evicting messages. This is quite surprising and can't be explained looking at SC code. I suspect the allocations are happening inside RavenDB.
The curve seems to flatten...
It seems that the cleanup cost goes up with the retention period/database size. This is something we need to further measure and keep in mind when talking to customers. It is likely that over certain DB size (or number of stored documents) the bottleneck moves from the ingestion to cleanup.
Has the relevant information from this issue been captured in an appropriate MD file somewhere?
@udidahan nope. I've added it to the PoA.
Created a doco pull here https://github.com/Particular/ServiceControl/pull/1344
Complete results available here
Recommendations