Closed sakno closed 3 years ago
Hi @sakno, I like both optimizations! Hopefully I will have time to take advantage of them before the research project this is part of is finished.
Here is benchmark of various PersistentState caching modes on my NVMe SSD: |
Method | Mean | Error | StdDev |
---|---|---|---|---|
ReadLogEntriesWithFullCacheAsync | 4.230 us | 0.0179 us | 0.0139 us | |
ReadLogEntriesWithMetadataCacheAsync | 22.512 us | 0.4412 us | 0.8605 us | |
ReadLogEntriesWithoutMetadataCacheAsync | 42.370 us | 0.8241 us | 1.4214 us |
You can try this benchmark on your machine. Run DotNext.Benchmarks
project with dotnet run -c Bench
and select PersistentStateBenchmark
.
Mine results are not as good as yours... | Method | Mean | Error | StdDev |
---|---|---|---|---|
ReadLogEntriesWithFullCacheAsync | 8.268 us | 0.0766 us | 0.0716 us | |
ReadLogEntriesWithMetadataCacheAsync | 213.049 us | 8.0940 us | 23.8652 us | |
ReadLogEntriesWithoutMetadataCacheAsync | 227.656 us | 4.4002 us | 3.9007 us |
The actual result depends on the hardware. In case of ReadLogEntriesWithFullCacheAsync
benchmark, the result depends on CPU because log entry fully cached in the memory and disk I/O is not needed. The last two benchmarks have dependency on disk I/O. In my case, it's NVMe SSD which uses PCIe as a bus. In your case it's SSD with SATA interface or even HDD. ReadLogEntriesWithMetadataCacheAsync
result looks very suspicious because of high value of standard deviation. It seems like some another task was doing its job in the same time with the benchmark.
@potrusil-osi , do you have estimation of time needed to complete your research work? I'm ready to publish release.
We have two more weeks. After the last two optimizations you provided the local results on my computer are pretty impressive. In our testing cluster the results are not as great, so we are digging deeper what the bottlenecks are.
@sakno, back to your analysis of the benchmark results: I also have a NVMe SSD drive (Samsung PM981 1TB). Any thoughts why my results are so much worse than yours? Are you running the benchmark in Windows?
No, I ran these tests on Linux. But I have machine with Windows 10:
Method | Mean | Error | StdDev |
---|---|---|---|
ReadLogEntriesWithFullCacheAsync | 9.568 us | 0.1045 us | 0.0872 us |
ReadLogEntriesWithMetadataCacheAsync | 226.568 us | 4.5002 us | 3.9893 us |
ReadLogEntriesWithoutMetadataCacheAsync | 284.843 us | 5.6811 us | 5.3141 us |
The results are close to your. But machine with Windows 10 has SATA SSD instead of NVMe SSD installed on PCIe. I think your laptop has two drives: system partition running on NVMe and user partition running on SATA. Benchmark creates WAL in temp directory. Usually, this directory is mounted to SSD SATA drive in case of two drives by default.
Also, in your case 23.8652 us
is very abnormal StdDev. Try to run this benchmark again without any background tasks.
There is only one drive. But after disabling half the services the results are a bit better: | Method | Mean | Error | StdDev |
---|---|---|---|---|
ReadLogEntriesWithFullCacheAsync | 6.396 us | 0.0646 us | 0.0604 us | |
ReadLogEntriesWithMetadataCacheAsync | 106.715 us | 0.7805 us | 0.6517 us | |
ReadLogEntriesWithoutMetadataCacheAsync | 140.691 us | 0.6169 us | 0.5771 us |
In benchmark I didn't try to choose optimal configuration parameters. However, I changed default buffer size to 4096 bytes which is equal to filesystem cluster size and now the results much better on SATA SSD but remains the same for NVMe. Pull latest changes from develop
branch and try again.
After getting the latest develop : |
Method | Mean | Error | StdDev |
---|---|---|---|---|
ReadLogEntriesWithFullCacheAsync | 6.208 us | 0.0295 us | 0.0247 us | |
ReadLogEntriesWithMetadataCacheAsync | 59.048 us | 0.2609 us | 0.2313 us | |
ReadLogEntriesWithoutMetadataCacheAsync | 89.919 us | 0.5084 us | 0.4507 us |
Thanks for sharing results! Also, I managed to cut off a few microseconds by optimizing hot paths in code within UnsafeReadAsync
method which is workhorse for all reading operations.
Hi @sakno, I am a colleague of @potrusil-osi who is out of the office this week. We have deployed our application to a cluster on dedicated hardware to continue testing. Unfortunately, we are not seeing the same degree of benefit from the optimizations that have been seen when testing locally, as our overall throughput numbers are only about 80% of what was achieved in the local testing. In particular, we are seeing two issues that we would like to get your thoughts on.
We are seeing that the write lock acquisition in AppendCached()
is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend()
which takes on average (~1ms). With the compaction improvements, we are seeing an average compaction time of ~9ms. Do you have any thoughts that might explain why this is taking so long to acquire the write lock in AppendCached()
?
In the ReadBuffered()
and ReadUnbuffered()
methods, we are seeing that a read lock is being acquired, which on average is taking ~4ms. We have configured the PersitentState.Options
with CopyOnReadOptions
, which from the previous discussion, indicates that that this operation shouldn't require a read lock at all. Is it expected that there should still be a lock required for this operation?
Hi @afink-osi , How many parallel writes do you have? Are reads/writes intermixed? What's the average size of log entry record?
We are seeing that the write lock acquisition in AppendCached() is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend() which takes on average (~1ms).
It depends on numbers of parallel writes. For instance, if you have 36 parallel writes and each write operation takes ~1 ms then the last writer need to wait ~36 ms.
Additionally, there is another project FASTER that offers general-purpose and high-performance WAL. It might be a good candidate for Raft but requires a lot of research.
@afink-osi , @potrusil-osi , let's try again. What I've done: cached log entry will not be persisted until commit. As a result, AppendCachedAsync
will be completed synchronously in most cases and lock contention will be low.
Hi @sakno, thanks for the quick response and for the clarification! Yesterday when running tests, we were seeing replication throughput at ~28% the throughput of our baseline (the application running without replication). Today, we were able to run some tests using the same storage volume as before but with your latest changes, and the results are very impressive! We are now running at about 65% of the baseline which is much more in line with the local tests that @potrusil-osi was running before and we are seeing minimal time waiting for locks. We also tested using memory instead of the SSD disk (may be acceptable for our use case) for log storage and pushed this further to ~72% of the baseline.
One more question: In the new method PersistCachedEntryAsync()
, there is a call to WriteAsync() but we do not see any call to a Flush()
method. Are these entries being flushed somewhere else?
I missed call to FlushAsync
, my bad. Now it's added. Flushing happens only if partition switched or the end of the log reached for better I/O performance especially if cached log entries are small.
Also I added special TryGetBufferWriter
method to IAsyncBinaryWriter
interface that allows to enable fast synchronous serialization. When you appending log entries with addToCache
parameter the writer will return non-null result. This method is supported only if IAsyncBinaryWriter
backed by pipe or dynamic in-memory buffer. In other cases, it will return null.
@potrusil-osi , @afink-osi , I've added support of diagnostic events to persistent WAL. You can configure them via PersistentState.Options
type:
LockContentionCounter
measures the rate of lock contentionLockDurationCounter
measures the duration of the lock in acquired state, in msReadCounter
measures the rate of reads, in the number of log entriesWriteCounter
measures the rate of writes, in the number of log entriesCompactionCounter
measures the rate of log compaction, in the number of log entriesCommitCounter
measures the rate of commits, in the number of log entriesYou can use dotnet-counters tool from .NET SDK to track the counters or submit values to the monitoring tool.
@potrusil-osi , @afink-osi , I would like to ask you guys to check develop
branch again. It contains another optimization of writes based on memory-mapped files. Read performance remains the same. Also, replication process now optimized to reduce numbers of reads from WAL. Thanks in advance!
@sakno The new counters look very helpful, I didn't get a chance to try them yet but will hopefully look at them tomorrow. We were able to test the new changes to develop
branch, and we are seeing an additional 11% increase in throughput in our application compared to before these latest optimizations!
Nice to hear that, @afink-osi ! Let me know when you check these counters.
@afink-osi , @potrusil-osi , did you have a chance to finalize your research project? If so, I would like to release new version of .NEXT.
@sakno We have completed our research project, unfortunately, we also did not have time to dig into the counters that you added before our time window closed. Thank you for your quick and thorough responses and improvements throughout this work! You may proceed with your release
Hi @sakno, I also wanted to thank you for all the work. It's been great experience to work with you on improving the WAL performance!
.NEXT 3.1.0 has been released. Closing issue.
See suggestions in #56 . Tasks:
SkipAsync
toIAsyncBinaryReader
PersistentState
in a way that allows to do the compaction in parallel with writes or readsfsync
for every write must be disabled by default (FileOptions.WriteThrough
flag)