dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.61k stars 120 forks source link

Improve performance of persistent write-ahead log #57

Closed sakno closed 3 years ago

sakno commented 3 years ago

See suggestions in #56 . Tasks:

potrusil-osi commented 3 years ago

Hi @sakno, I like both optimizations! Hopefully I will have time to take advantage of them before the research project this is part of is finished.

sakno commented 3 years ago
Here is benchmark of various PersistentState caching modes on my NVMe SSD: Method Mean Error StdDev
ReadLogEntriesWithFullCacheAsync 4.230 us 0.0179 us 0.0139 us
ReadLogEntriesWithMetadataCacheAsync 22.512 us 0.4412 us 0.8605 us
ReadLogEntriesWithoutMetadataCacheAsync 42.370 us 0.8241 us 1.4214 us

You can try this benchmark on your machine. Run DotNext.Benchmarks project with dotnet run -c Bench and select PersistentStateBenchmark.

potrusil-osi commented 3 years ago
Mine results are not as good as yours... Method Mean Error StdDev
ReadLogEntriesWithFullCacheAsync 8.268 us 0.0766 us 0.0716 us
ReadLogEntriesWithMetadataCacheAsync 213.049 us 8.0940 us 23.8652 us
ReadLogEntriesWithoutMetadataCacheAsync 227.656 us 4.4002 us 3.9007 us
sakno commented 3 years ago

The actual result depends on the hardware. In case of ReadLogEntriesWithFullCacheAsync benchmark, the result depends on CPU because log entry fully cached in the memory and disk I/O is not needed. The last two benchmarks have dependency on disk I/O. In my case, it's NVMe SSD which uses PCIe as a bus. In your case it's SSD with SATA interface or even HDD. ReadLogEntriesWithMetadataCacheAsync result looks very suspicious because of high value of standard deviation. It seems like some another task was doing its job in the same time with the benchmark.

sakno commented 3 years ago

@potrusil-osi , do you have estimation of time needed to complete your research work? I'm ready to publish release.

potrusil-osi commented 3 years ago

We have two more weeks. After the last two optimizations you provided the local results on my computer are pretty impressive. In our testing cluster the results are not as great, so we are digging deeper what the bottlenecks are.

potrusil-osi commented 3 years ago

@sakno, back to your analysis of the benchmark results: I also have a NVMe SSD drive (Samsung PM981 1TB). Any thoughts why my results are so much worse than yours? Are you running the benchmark in Windows?

sakno commented 3 years ago

No, I ran these tests on Linux. But I have machine with Windows 10:

Method Mean Error StdDev
ReadLogEntriesWithFullCacheAsync 9.568 us 0.1045 us 0.0872 us
ReadLogEntriesWithMetadataCacheAsync 226.568 us 4.5002 us 3.9893 us
ReadLogEntriesWithoutMetadataCacheAsync 284.843 us 5.6811 us 5.3141 us

The results are close to your. But machine with Windows 10 has SATA SSD instead of NVMe SSD installed on PCIe. I think your laptop has two drives: system partition running on NVMe and user partition running on SATA. Benchmark creates WAL in temp directory. Usually, this directory is mounted to SSD SATA drive in case of two drives by default.

Also, in your case 23.8652 us is very abnormal StdDev. Try to run this benchmark again without any background tasks.

potrusil-osi commented 3 years ago
There is only one drive. But after disabling half the services the results are a bit better: Method Mean Error StdDev
ReadLogEntriesWithFullCacheAsync 6.396 us 0.0646 us 0.0604 us
ReadLogEntriesWithMetadataCacheAsync 106.715 us 0.7805 us 0.6517 us
ReadLogEntriesWithoutMetadataCacheAsync 140.691 us 0.6169 us 0.5771 us
sakno commented 3 years ago

In benchmark I didn't try to choose optimal configuration parameters. However, I changed default buffer size to 4096 bytes which is equal to filesystem cluster size and now the results much better on SATA SSD but remains the same for NVMe. Pull latest changes from develop branch and try again.

potrusil-osi commented 3 years ago
After getting the latest develop: Method Mean Error StdDev
ReadLogEntriesWithFullCacheAsync 6.208 us 0.0295 us 0.0247 us
ReadLogEntriesWithMetadataCacheAsync 59.048 us 0.2609 us 0.2313 us
ReadLogEntriesWithoutMetadataCacheAsync 89.919 us 0.5084 us 0.4507 us
sakno commented 3 years ago

Thanks for sharing results! Also, I managed to cut off a few microseconds by optimizing hot paths in code within UnsafeReadAsync method which is workhorse for all reading operations.

afink-osi commented 3 years ago

Hi @sakno, I am a colleague of @potrusil-osi who is out of the office this week. We have deployed our application to a cluster on dedicated hardware to continue testing. Unfortunately, we are not seeing the same degree of benefit from the optimizations that have been seen when testing locally, as our overall throughput numbers are only about 80% of what was achieved in the local testing. In particular, we are seeing two issues that we would like to get your thoughts on.

  1. Write Lock Acquisition

We are seeing that the write lock acquisition in AppendCached() is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend() which takes on average (~1ms). With the compaction improvements, we are seeing an average compaction time of ~9ms. Do you have any thoughts that might explain why this is taking so long to acquire the write lock in AppendCached()?

  1. Read Lock Acquisition

In the ReadBuffered() and ReadUnbuffered() methods, we are seeing that a read lock is being acquired, which on average is taking ~4ms. We have configured the PersitentState.Options with CopyOnReadOptions, which from the previous discussion, indicates that that this operation shouldn't require a read lock at all. Is it expected that there should still be a lock required for this operation?

sakno commented 3 years ago

Hi @afink-osi , How many parallel writes do you have? Are reads/writes intermixed? What's the average size of log entry record?

sakno commented 3 years ago

We are seeing that the write lock acquisition in AppendCached() is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend() which takes on average (~1ms).

It depends on numbers of parallel writes. For instance, if you have 36 parallel writes and each write operation takes ~1 ms then the last writer need to wait ~36 ms.

sakno commented 3 years ago

Additionally, there is another project FASTER that offers general-purpose and high-performance WAL. It might be a good candidate for Raft but requires a lot of research.

sakno commented 3 years ago

@afink-osi , @potrusil-osi , let's try again. What I've done: cached log entry will not be persisted until commit. As a result, AppendCachedAsync will be completed synchronously in most cases and lock contention will be low.

afink-osi commented 3 years ago

Hi @sakno, thanks for the quick response and for the clarification! Yesterday when running tests, we were seeing replication throughput at ~28% the throughput of our baseline (the application running without replication). Today, we were able to run some tests using the same storage volume as before but with your latest changes, and the results are very impressive! We are now running at about 65% of the baseline which is much more in line with the local tests that @potrusil-osi was running before and we are seeing minimal time waiting for locks. We also tested using memory instead of the SSD disk (may be acceptable for our use case) for log storage and pushed this further to ~72% of the baseline.

One more question: In the new method PersistCachedEntryAsync(), there is a call to WriteAsync() but we do not see any call to a Flush() method. Are these entries being flushed somewhere else?

sakno commented 3 years ago

I missed call to FlushAsync, my bad. Now it's added. Flushing happens only if partition switched or the end of the log reached for better I/O performance especially if cached log entries are small.

sakno commented 3 years ago

Also I added special TryGetBufferWriter method to IAsyncBinaryWriter interface that allows to enable fast synchronous serialization. When you appending log entries with addToCache parameter the writer will return non-null result. This method is supported only if IAsyncBinaryWriter backed by pipe or dynamic in-memory buffer. In other cases, it will return null.

sakno commented 3 years ago

@potrusil-osi , @afink-osi , I've added support of diagnostic events to persistent WAL. You can configure them via PersistentState.Options type:

You can use dotnet-counters tool from .NET SDK to track the counters or submit values to the monitoring tool.

sakno commented 3 years ago

@potrusil-osi , @afink-osi , I would like to ask you guys to check develop branch again. It contains another optimization of writes based on memory-mapped files. Read performance remains the same. Also, replication process now optimized to reduce numbers of reads from WAL. Thanks in advance!

afink-osi commented 3 years ago

@sakno The new counters look very helpful, I didn't get a chance to try them yet but will hopefully look at them tomorrow. We were able to test the new changes to develop branch, and we are seeing an additional 11% increase in throughput in our application compared to before these latest optimizations!

sakno commented 3 years ago

Nice to hear that, @afink-osi ! Let me know when you check these counters.

sakno commented 3 years ago

@afink-osi , @potrusil-osi , did you have a chance to finalize your research project? If so, I would like to release new version of .NEXT.

afink-osi commented 3 years ago

@sakno We have completed our research project, unfortunately, we also did not have time to dig into the counters that you added before our time window closed. Thank you for your quick and thorough responses and improvements throughout this work! You may proceed with your release

potrusil-osi commented 3 years ago

Hi @sakno, I also wanted to thank you for all the work. It's been great experience to work with you on improving the WAL performance!

sakno commented 3 years ago

.NEXT 3.1.0 has been released. Closing issue.