Spurious excessive write latency

JoelKatz commented 3 years ago

I've been analyzing some massive outliers in write latency in rippled's use of NuDB and tracked the problem down to two things that appear to be deficiencies in NuDB. I have a quick and dirty patch that makes the problem go away, but it's not yet suitable for including in NuDB. I'm opening this issue for any input on whether what I'm thinking makes sense, whether it's worth improving my changes for submission, or whether there's a better way to resolve my problem.

The first issue comes from this code in basic_store/insert:

     auto const rate = static_cast<std::size_t>(
         std::ceil(work / elapsed.count()));
     auto const sleep =
         s_->rate && rate > s_->rate;

This code measures the rate at which data is being written to the store since the last flush and imposes delays on threads that write if the rate exceeds the measured write rate. The problem is that this is extremely inaccurate for the first few writes after a flush. During the flush, a mutex is held. When it's released, backed up writes flow to the store. The first write will almost always exceed the sustained write rate, so that thread is almost always made to sleep.

My proposed quick and dirty fix for this is to set a minimum amount of write data that must be backed up before we impose a sleep on the writer.

The second issue comes from this code in basic_store/flush:

            auto const now = clock_type::now();
            auto const elapsed = duration_cast<duration<float>>(
                now > s_->when ? now - s_->when : clock_type::duration{1});
            s_->rate = static_cast<std::size_t>(
                std::ceil(work / elapsed.count()));

This works well if the batch size happens to be large. But if we happen to have a very small batch, the sustained write rate will be severely underestimated due to fixed overhead. For example, writing 100 objects will not take 100 times what writing 1 object takes, it will take much less than that. So small batches result in a severely underestimated sustained write rate, resulting in the next batch being small and long recovery times.

My proposed quick and dirty fix for this is to allow the estimated write rate to increase quickly but force it to decay slowly.

The net effect of these two changes is a massive decrease in latency spikes (on writes) coming from NuDB's write throttling code.

Here is the quick and dirty patch:

diff --git a/include/nudb/impl/basic_store.ipp b/include/nudb/impl/basic_store.ipp
index 4ac79e4..17d39ed 100644
--- a/include/nudb/impl/basic_store.ipp
+++ b/include/nudb/impl/basic_store.ipp
@@ -342,12 +342,13 @@ insert(
     auto const rate = static_cast<std::size_t>(
         std::ceil(work / elapsed.count()));
     auto const sleep =
-        s_->rate && rate > s_->rate;
+        s_->rate && rate > s_->rate && work > 32*1024*1024;
     m.unlock();

     // The caller of insert must be blocked when the rate of insertion
     // (measured in approximate bytes per second) exceeds the maximum rate
-    // that can be flushed. The precise sleep duration is not important.
+    // that can be flushed and lots of data is already cached.
+    // The precise sleep duration is not important.
     if(sleep)
         std::this_thread::sleep_for(milliseconds{25});
 }
@@ -768,6 +769,14 @@ flush()
             auto const now = clock_type::now();
             auto const elapsed = duration_cast<duration<float>>(
                 now > s_->when ? now - s_->when : clock_type::duration{1});
+            auto const rate = std::ceil(work / elapsed.count());
+
+            // raise the rate quickly, drop it slowly
+            if (s_->rate <= rate)
+                s_->rate = rate;
+            else
+                s_->rate = (s_->rate * 16 - s_->rate + rate) / 16;
+
             s_->rate = static_cast<std::size_t>(
                 std::ceil(work / elapsed.count()));
         #if NUDB_DEBUG_LOG

Should I try to finalize these for inclusion? Does it sound like I've correctly understood the cause of the latency spikes? Are there better solutions?

Thanks in advance.

vinniefalco commented 3 years ago

The current algorithm for determining when to flush is definitely not perfect

miguelportilla commented 3 years ago

Additionally, a possible improvement to the sleep might be...

auto const sleep = [&](){
    if (s_->rate == 0 || rate <= s_->rate || work <= 256*1024*1024)
        return 0;
    return (std::min(2.0, static_cast<double>(rate) / s_->rate) * 100) - 100;
}();
m.unlock();
if(sleep)
    std::this_thread::sleep_for(milliseconds{sleep});

vinniefalco commented 3 years ago

Would it be better if this was configurable, for example to specify a bytes per second maximum?

vinniefalco commented 3 years ago

If a fix is urgent, we can merge your patch immediately while we think about a possibly more robust solution.

mellery451 commented 3 years ago

rippled could easily be configured to use an alternate fork/branch if desired: https://github.com/ripple/rippled/blob/develop/Builds/CMake/deps/Nudb.cmake#L12-L31

JoelKatz commented 3 years ago

It's definitely not urgent. I thought about a configurable bytes per second maximum, but I think that would impose a tricky burden on people to figure out what the right setting is.

I think a configurable minimum backed up data size before throttling is probably needed. I can't think of any way to get that to autotune. Applications that know how much memory they're supposed to be using can set that. (The purpose of the cache size before throttling is not to set the actual size of any cache but to ensure that overhead is amortized over the batch and not charged to the first few writes. So it probably won't need tuning. It's more like an I/O batch size than a cache size.)

I think we really want the data rate to auto-tune. My change works, at least under the conditions I tested it. But it does feel a bit ugly. We could take the batch size into account when deciding whether to update the flush rate.

Maybe we can take advantage of the way the two changes interact. We aren't going to delay any threads if the batch doesn't exceed the cache anyway -- so the flush rate will be ignored for "small" batches anyway. So maybe update the flush rate only if the batch size exceeds the cache size or is larger than the largest batch we've ever flushed?

That would get us down to one knob -- the cache size before we impose delays. Everything else would autotune without weird embedded constants like my patch has. The only choice then would be the default for that cache size. I'd nominate 32MB, just as a "seems about right" number.

By the way, without these changes, 5% of the time rippled's critical "build a new ledger" path takes over 500ms. With these changes, 99.5% of the time, the path takes under 200ms. So the difference is dramatic.

JoelKatz commented 3 years ago

This sets a tunable "burst size". We do not stall threads until the amount of pending data exceeds the burst size and we do not measure the speed of write batches below the burst size. This completely eliminated performance outliers in my test run. The burst size defaults to a mere 4MB, which seems to be enough. This suggests that the problem is purely due to very pathological behavior of the existing code with very small batches and not due to any need for large batches (and thus large RAM consumption) to get good performance.

There is really no need to ever tune this, at least as far as my testing has revealed. I left it tunable in case other people's applications are different from mine. Setting it to zero restores the original behavior.

diff --git a/include/nudb/basic_store.hpp b/include/nudb/basic_store.hpp
index 445e2e6..732cf2d 100644
--- a/include/nudb/basic_store.hpp
+++ b/include/nudb/basic_store.hpp
@@ -78,6 +78,7 @@ private:
         detail::key_file_header kh;

         std::size_t rate = 0;
+        std::size_t burst = 4 * 1024 * 1024;
         time_point when = clock_type::now();

         state(state const&) = delete;
@@ -426,6 +427,22 @@ public:
     insert(void const* key, void const* data,
         nsize_t bytes, error_code& ec);

+    /** Set the burst size
+
+        This function sets the amount of data that can be
+        cached before writing threads are throttled if the
+        sustained write flush write is exceeded.
+
+        @par Thread safety
+
+        Safe to call concurrently with any function except
+        @ref close.
+
+        @param burst_size The number of bytes before throttling
+    */
+    void
+    set_burst(std::size_t burst_size);
+
 private:
     template<class Callback>
     void
diff --git a/include/nudb/impl/basic_store.ipp b/include/nudb/impl/basic_store.ipp
index 4ac79e4..4b530ba 100644
--- a/include/nudb/impl/basic_store.ipp
+++ b/include/nudb/impl/basic_store.ipp
@@ -342,12 +342,13 @@ insert(
     auto const rate = static_cast<std::size_t>(
         std::ceil(work / elapsed.count()));
     auto const sleep =
-        s_->rate && rate > s_->rate;
+        s_->rate && rate > s_->rate && work > s_->burst;
     m.unlock();

     // The caller of insert must be blocked when the rate of insertion
     // (measured in approximate bytes per second) exceeds the maximum rate
-    // that can be flushed. The precise sleep duration is not important.
+    // that can be flushed and a burst of data is already in memory.
+    // The precise sleep duration is not important.
     if(sleep)
         std::this_thread::sleep_for(milliseconds{25});
 }
@@ -452,6 +453,18 @@ exists(
     return false;
 }

+//  Set the burst size
+//
+template<class Hasher, class File>
+void
+basic_store<Hasher, File>::
+set_burst(
+    std::size_t burst_size)
+{
+    detail::unique_lock_type m{m_};
+    s_->burst = burst_size;
+}
+
 //  Split the bucket in b1 to b2
 //  b1 must be loaded
 //  tmp is used as a temporary buffer
@@ -768,8 +781,11 @@ flush()
             auto const now = clock_type::now();
             auto const elapsed = duration_cast<duration<float>>(
                 now > s_->when ? now - s_->when : clock_type::duration{1});
-            s_->rate = static_cast<std::size_t>(
-                std::ceil(work / elapsed.count()));
+            auto const rate = std::ceil(work / elapsed.count());
+
+            if (work > s_->burst)
+                s_->rate = rate;
+
         #if NUDB_DEBUG_LOG
             dout <<
                 "work=" << work <<

vinniefalco commented 3 years ago

By the way, without these changes, 5% of the time rippled's critical "build a new ledger" path takes over 500ms. With these changes, 99.5% of the time, the path takes under 200ms. So the difference is dramatic.

Because my algorithm was sort of arbitrary, any solution which works for your use-case is an improvement. I don't know anyone else who is using NuDB so we should optimize for the known case.

JoelKatz commented 3 years ago

If nobody has any better ideas or objections, I'll prepare a pull request identical to the diff above.

I think it might be a good idea to make the sleep time autotune. I think there's a benefit to not making the sleep any longer than absolutely necessary at first but also to ensure that even if several threads are querying NuDB, the memory footprint can't grow without bound. But I haven't proven there's a problem nor tested any solutions to it, so I'm leaning towards leaving it. (This would mainly be an issue on machines with slow I/O that have little business running NuDB anyway. ;)

JoelKatz commented 3 years ago

This might make some sense too. Instead of timing from when the flush that's in progress began, time from when the last flush began until we're finished flushing. This prevents the tolerated write rate from needlessly starting from zero. Effectively, until we're finished flushing, we treat it as if we're adding onto the previous set

diff --git a/include/nudb/impl/basic_store.ipp b/include/nudb/impl/basic_store.ipp
index 876cce4..6d4a8f8 100644
--- a/include/nudb/impl/basic_store.ipp
+++ b/include/nudb/impl/basic_store.ipp
@@ -767,7 +767,7 @@ flush()
 #endif
     {
         unique_lock_type m{m_};
-        s_->when = clock_type::now();
+        auto when = clock_type::now();
         if(! s_->p1.empty())
         {
             std::size_t work;
@@ -785,6 +785,7 @@ flush()

             // Writes below the burst size may be dominated by
             // overhead and give an artificially low write rate
+            s_->when = when;
             if (work > s_->burst)
                 s_->rate = rate;

miguelportilla commented 3 years ago

Resolve in https://github.com/CPPAlliance/NuDB/pull/84

JoelKatz commented 3 years ago

This may only be meaningful on my machine, but it's a data point. The graph shows the write rate (in MB/s) versus the write size (in MB). One is a zoom on focusing on small write batch sizes.

cppalliance / NuDB

Spurious excessive write latency #83