Performance comparison between V3 and V2

SzymonPobiega commented 6 years ago

Complete results available here

[x] Create baseline using V3+MSMQ and compare to previously generated V2+MSMQ results -> 165 msg/s
[x] Check if using striped disks affects performance -> no
[x] Test V3+ASB -> 155 msg/s
[x] Test V2+ASB -> 70 msg/s
[x] Test V3+SQL -> 160 msg/s
[x] Test V3+ASQ -> 160 msg/s
[x] Test V3+RabbitMQ -> 152 msg/s
[x] Test different message body sizes with RabbitMQ (fast transport) and 5 minute retention
- [x] 0 KB -> already tested
- [x] 13 KB -> 140 msg/s
- [x] 20 KB -> 130 msg/s
- [x] 40 KB -> 102 msg/s
- [x] 66 KB (fits both in the document and in the body stoage) -> 80 msg/s , 4 GB database, 22 GB RAM after processing 450K messages, cleanup cost ~4 ms/message
- [x] 93 KB (fits in body storage only) ->107 msg/s, 12 GB RAM after processing 550K messages, cleanup cost ~2 ms/message
- [x] 133 KB (too big to store) -> 150 msg/s, cleanup cost ~1 ms/message
[x] Test longer expiry periods (larger database) with 80 KB messages
- [x] 5 minutes -> already tested for 65 KB
- [x] 30 minutes -> 65 msg/s, 20 GB database, up to 60 GB of RAM
- [x] 6 hours -> 65 msg/s, 300 GB database, cleanup cost ~6 ms/message
- [x] 3 days (10 KB) -> 120 msg/s, 1 TB database, cleanup cost ~8 ms/message
[x] Document the results in a markdown file

Recommendations

Investigate the behavior of SC with larger DB sizes and retention periods matching real world scenarios (30 days) as it seems that the cleanup throughput is affected by size
- Find the critical size and/or document count over which RavenDB is no longer able to cope up with the deletes
- Consider implementing adaptive cleanup mechanism that prevents going over the critical point
- Consider implementing a circuit breaker which prevents loading more audit documents if RavenDB can't cope up with deletes

SzymonPobiega commented 6 years ago

We run V3+MSMQ setup for almost 24 hours to make sure the throughput does not decrease and there are no memory leaks. The mem consumption of SC was oscillating between 8 and 15 GB. No memory leaks were detected.

We measured sustainable audit processing throughput of 165 msg/s compared to 115 msg/s on similar hardware in V2 which is over 40% increase.

SzymonPobiega commented 6 years ago

We run similar setup as above but with data disk created by striping two 7500 IOPS disks. The measured throughput was the same which suggests that the disk latency, not throughput, is the limitation.

SzymonPobiega commented 6 years ago

We run ASB tests on V2 and V3. The throughput measured was 70 and 155 msg/s respectively, an increase of more than 100%. @danielmarbach was able to get to 800 msg/s on his beefy hardware which suggests that the V3 result of 155 is limited, again, by the disk latency, not the transport.

SzymonPobiega commented 6 years ago

The weekend V3 SQL run showed stable throughput of 160 msg/s in V3.

SzymonPobiega commented 6 years ago

cleanup

Shows the disk behavior when processing 93 KB messages. In this case we might be actually hitting the disk throughput limits with ~300 writes/s and 30 MB of written data per second during ingestion and much more during peaks (cleanup)

SzymonPobiega commented 6 years ago

The large message size tests showed that the size of the message seems effect the ingestion throughput significantly and the effect is not linear due to the threshold used by SC:

There seems to be gradual degradation of performance with peak at 66 KB related to the fact that messages under 85KB are both stored in the body storage (as attachment) and, inline, in the document itself
Over 85 KB but below 100 KB messages are only stored in the body storage so, counter-intuitively making message larger than 85KB increases the ingestion throughput
Finally, bodies of messages larger than 100 KB are not stored at all. In this case the throughput is similar to 0 KB messages

The 85KB test run showed much larger memory usage, likely caused by managing in-document message bodies. We need to repeat this run for a longer period of time (overnight) to make sure the mem consumption is large but stable.

SzymonPobiega commented 6 years ago

I've observed super high memory consumption in the 80 KB message run -- up to all machine's resources (60 GB). After SC consumed all the memory it went through a very long GC pause (1 minute) and recovered at around 10 GB of mem consumption.

mikeminutillo commented 6 years ago

Is 85 KB too high a threshold maybe? We were aiming to avoid garbage collection when we implemented that limit. It's possible that some overhead somewhere has pushed us back over that threshold.

SzymonPobiega commented 6 years ago

Here's the mem usage of SC in the mode described above. I don't see any leaks in memory. Only stable allocation in LOH and long GC pauses.

mikeminutillo commented 6 years ago

It was specifically the LOH we were trying to avoid

SzymonPobiega commented 6 years ago

The largest sub-LOH size messages (40 KB) show interesting memory allocation graph

42kb

The allocations start at 8 which is 30 minutes after the SC started processing messages. The audit retention period is set to 30 seconds so it sees like the allocations start when the cleanup job starts evicting messages. This is quite surprising and can't be explained looking at SC code. I suspect the allocations are happening inside RavenDB.

SzymonPobiega commented 6 years ago

The curve seems to flatten...

42kb-2

SzymonPobiega commented 6 years ago

It seems that the cleanup cost goes up with the retention period/database size. This is something we need to further measure and keep in mind when talking to customers. It is likely that over certain DB size (or number of stored documents) the bottleneck moves from the ingestion to cleanup.

udidahan commented 6 years ago

Has the relevant information from this issue been captured in an appropriate MD file somewhere?

SzymonPobiega commented 6 years ago

@udidahan nope. I've added it to the PoA.

SzymonPobiega commented 6 years ago

Created a doco pull here https://github.com/Particular/ServiceControl/pull/1344

Particular / ServiceControl

Performance comparison between V3 and V2 #1334

Recommendations