facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.72k stars 6.34k forks source link

RFC: Pluggable WriteController #3271

Open matthewvon opened 6 years ago

matthewvon commented 6 years ago

I presented an alternate write throttle for slower class servers at the annual RocksDB Meetup December 4, 2017. The current implementation is completely external to RocksDB. During the Q&A session, I was asked how this could be better/cleanly integrated with RocksDB. My answer was to have the WriteController object pluggable, like the bloom filters and rate limiters. It was suggested that I post an RFC here first, then potentially submit a PR with said change and incorporating comments posted against the issue.

My suggested change is:

  1. add a shared_ptr object to the Options structure where a user can supply a custom WriteController. This is a shared_ptr, not a unique_ptr, in case the user has multiple database instances that wish to share the same WriteController so as to have a "system-wide" view.

  2. the shared_ptr from the Options structure is assigned to the DB_Impl object in place of the current WriteController member object. If the user does not supply a custom WriteController, an instance of the existing WriteController is automatically added.

  3. write_controller.h moves to the include/rocksdb directory.

  4. add "virtual" to most of the existing WriteController methods. This simplifies creating object derivatives.

  5. optional: I have considered making some of the existing WriteController routines thread safe. They currently assume the caller holds the database mutex. This could be awkward if a derived instance is actually shared across multiple database instances. Otherwise, any user derived object will have to include the necessary protections.

siying commented 6 years ago

How about add a simple API DB::SetDelayedWriteRate(), that override the internal calculated delayed write rate?

ajkr commented 6 years ago

I have some questions/suggestions about your experience tuning RocksDB for slow storage, as we prefer RocksDB to natively support those environments if possible.

  1. have you tried reducing level0_slowdown_writes_trigger? That'll cause delay to user writes to be applied earlier, giving more time for compaction to happen before stopping writes.
  2. have you tried reducing delayed_write_rate? Our delay by default starts at 16MB/s, which may be higher than the throughput EBS (or whatever you're using) is providing.
  3. have you tried using our rate limiter's fairness parameter? That can control the ratio of background I/O devoted to flush vs compaction, in case you see one blocking the other.
  4. how many background threads were you using for flushes/compactions?
ajkr commented 6 years ago

BTW, we looked into your logs and noticed the same trend you mentioned in your talk, where delay is reduced as we approach the stop limit. It's not intended behavior so we're planning to fix that. Thanks for raising it :).

matthewvon commented 6 years ago

@ajkr Responding to your suggestions:

  1. With rocksdb, no. My previous experience is with Google's leveldb where level0_slowdow_writes_trigger is 8 and ceiling is 12. That setting does not help here because it has no feedback relating to drive performance / drive total activity. It also does not help if the problem is with too many pending write buffers.

  2. delayed_write_rate is implicitly the number my code is changing in a dynamic fashion. @siying's suggestion of an API to set it anytime has strong merit. I only resist because the code also needs to activate the use of the delayed_write_rate by calling WriteController::GetDelayToken(). Goal here is a "one size fits all" algorithm that adjusts to the user's environment instead of having to tune to each customer's disk system. And adjusts to disk IO overhead of different read patterns and external application usage of the disk drive.

  3. based upon my read of the code (which might be lacking) your rate limiter is used in two places: the compaction write loop and the write controller. The write controller usage is hard coded to the rocksdb rate limiter without any way to change it externally. Making the WriteController object available for user replacement implies making the default rate limiter available to change within the WriteController. Also, there exists a huge philosophical difference in my code and the rate limiter for compaction. My code assumes every write operation for flush and compaction is essential, therefore it is NOT to be controlled ... because we are already behind and only going to get worse. That said, my thread prioritization routine does a world of good fixing foreground operations' need versus background. You really should try it. Processing prioritization given the high CPU content of compactions is extremely more effective than I/O prioritization (which is only available on limited number of Linux I/O schedulers).

  4. the four core demo was using a total of 4 threads: one for flush and three for compactions (I believe that is a rocksdb default). What is really need is a third pool for level 0 compactions alone, and/or better compaction logic that knows to only schedule level 0 compactions if level 0 file count is above the level0_slowdown_writes_trigger. This is the second place where code is working against two highly inconsistent goals. Count of level 0 files sets goal of when to block writes. Compaction selection is based upon number of bytes in a level. My quick fix to that mistake is the thread priority logic that effective backs down the non-level 0 compactions, giving flushes and level 0 compactions priority ... and preventing hard stops of writes API calls.

matthewvon commented 6 years ago

@siying: I like your suggestion. I could have performed 90% of my change via your solution alone. The extra 10% that I prefer relates to applying the throttle rate over longer periods once the algorithm determines the overall flush and compaction activity based on the last hour's activity. This establishes a more smooth latency across all activity. This 10% involves accessing the WriteController::GetDelayToken() to extend the use of the delayed_write_rate.

ajkr commented 6 years ago

Thanks for the additional details, we'll discuss and try to get back to you by early next week.

ajkr commented 6 years ago

Sorry for the delay. We finally talked about it today. One thing we needed more help understanding is why WriteController is preferred for this purpose rather than a custom RateLimiter.

The RateLimiter specified in options already gets called for flush and compaction writes. You could manually feed user writes to the same object so that it can make decisions based on a global view of writes. It would not have the same problems as WriteController, which is that RocksDB changes it dynamically based on how much compaction debt exists. WriteController can then be disabled by setting stall triggers very high and you'd get full control this way.

matthewvon commented 6 years ago

First, you realize that the RateLimiter object within DBImpl.h is a dead object? There is no code that uses it. I have commented out its declare in db_impl.h and its initialization in db_impl.cc. Nothing broke in the compile.

Second, my initial code that takes over the WriteController object is also watching the compaction debt and responding dynamically. It is just responding in a way that is based upon the hardware throughput, not random constants. The RateLimiter that applies to flushes and compactions is of zero interest to me. If the system is behind in compactions, there is no benefit from rate limiting the already behind work. My alternative is to use CPU prioritization, not IO channel prioritization, to keep foreground operations lively but giving compactions every remaining CPU cycle to process (to catch up).

My initial WriteController object does not account for many variables, yet. But I need the foundation work in place before beginning heavy construction.

Your approach is about 4 years behind my prior work.

So, my request is still to make the WriteController pluggable so that alternative algorithms for all the problems you mentioned may be explored and shared.

ajkr commented 6 years ago
  1. I said explicitly "The RateLimiter specified in options". It is definitely used for all flush/compaction writes, just like I said. Maybe take a look at the code. Who said anything about the one in "db_impl.h"?

  2. You don't understand how WriteController works. It throttles user writes, not flushes or compactions. Again, take a look at the code.

That's great your ideas are four years ahead of ours. I am sure they're just as valuable as you think they are, and your ego isn't involved at all in such claims.

matthewvon commented 6 years ago

My wording was poor. I have code to contribute that would be optional. But it needs a proper place to plug in, or not plug in as is appropriate per user. I seriously doubt it will fit Facebook's environment, because you would have written it if your environment required it. This code is not intended for flushes or compactions. It is intended only for throttling incoming write rate to smooth user latencies against current and likely compactions ... but mostly to prevent all halts of user write calls. The halts are deadly.

See, my goals are completely different:

Not asking that Facebook ever adopt the exact code I write. Asking that you allow a pluggable object to make it possible for me and others to fulfill our needs.

ajkr commented 6 years ago

Ok, we talked more about it and will expose the WriteController publicly. I do still think the logic could be implemented in RateLimiter with some changes, like we add a new IOPriority to indicate which writes are user writes so you can choose to only throttle them. But it ends up not making a huge difference to us so we'll go your way. My wording above was poor, too :/.

matthewvon commented 6 years ago

Is the next step for me to create a PR, or for me to await a Facebook PR? I am happy to do the work to keep the ball rolling.

matthewvon commented 6 years ago

Me creating a PR for exposing the WriteController was the next step discussed at the RocksDB meetup after an RFC.

ajkr commented 6 years ago

Oh, I just did it, so we can go with mine if you want.

edit: just reread this issue's description for the first time in a month and realized you want to override its functions, not just call them. That'll need more discussion, unfortunately.

ajkr commented 6 years ago

While we're waiting on others' advice, would be interested to see if you'd reconsider putting your logic in RateLimiter. We can make sure it's called for every user read/write and you'd have the ability to throttle only those. You can simply disregard flush/compaction writes - they won't be throttled. It is much easier for us to make that change than expose WriteController for subclassing, which is tightly coupled to the DB (as you noticed its functions require holding DB mutex).

matthewvon commented 6 years ago

I am currently hacking into three calls: set_max_delayed_write_rate(), max_delayed_write_rate(), and GetDelayToken(). GetDelayToken() lets me determine when throttle applies. set_max_delayed_write_rate() implies set_delayed_write_rate().

My limited review of the code leaves me to believe that the RateLimiter object within WriteController object is independent of the flush and compaction RateLimiter, therefore outside my consideration.

I only call set_max_delayed_write_rate() and GetDelayToken() during EventListner callbacks for flush and compaction. The global db mutex is already held during the callback and therefore seems safe. I would like to call those routines every 60 seconds also, but disabled that code since it seemed un-thread safe.

I was proposing a pluggable WriteController so that the same instance could be used across multiple open databases. That is not a requirement in our code at this time. Just thinking for the future. I could accomplish the same with existing EventListener based logic.

All that said, public access to the routines I just mentioned would be sufficient at this time. I assume you would apply tests in the interface to see if global db mutex is already held or not, locking when not. Next steps for us would be to take more variables into consideration for throttle.

ajkr commented 6 years ago

To clarify, I've been talking about the RateLimiter specified in options, which is here: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L355. Please ignore other instances of RateLimiter; I very much agree they won't help your cause.

What I am proposing is changing where that RateLimiter is called such that it's called for (1) flushes, (2) compactions, and (3) user writes. It already is called for (1) and (2), and we are willing to make the change for (3). The type of write will be available to you in each call to RateLimiter::Request so you'd be able to prioritize them however you wish: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L19. We will happily add a user-write type to that enum.

Can you please double-check whether this fits your use-case? It is a lot easier to give control of user writes (and compaction, and flush, so everything) in that way. Synchronization would be easy and you wouldn't need to worry about DBImpl setting your rate limit in a strange way, or grabbing delay token when you don't want it.

If not (which I do understand is a reasonable case), I actually think we should give access to WriteController internals in a constrained way. Returning the whole object, like I did in my PR, is probably a bad idea as there's no way for users to grab DB mutex and really shouldn't be. Additionally, some of the functions like set_delayed_write_rate would be called by both user and DBImpl, so whoever calls it later would override.

We could handle your current use case by having two dynamic options: delayed_write_rate and force_delayed_writes. You could set them via SetOptions and we'd take care of grabbing the lock and invoking the corresponding WriteController functions (set_max_delayed_write_rate and GetDelayToken). We can expand these if needed in the future and give you read-only access via GetProperty for the info you need.

I don't really look forward to sharing WriteController across multiple instances. The way we call it in DBImpl is in such a way to prevent stall triggers from being reached, e.g., when number of L0 files grows too high. It wouldn't be fair to penalize a DB instance with low L0 count because another instance had high L0 count, just because they share the same WriteController.

matthewvon commented 6 years ago

This seems like an extremely reasonable compromise. Please give me two or three days to validate.

matthewvon commented 6 years ago

Where do you propose to add a call to RateLimiter for the user write path?

(Unrelated: a WriteController across database instances might want to slow an instance with a low L0 count simply because it knows the full flush and compaction load of the system is high. The smoothing of all instances reduces latency fluctuations in both read and write operations. But we are not debating this issue with regard to your RateLimiter proposal.)

ajkr commented 6 years ago

I think we can put it near the top of DBImpl::WriteImpl. That way RateLimiter will be called in the same thread calling Put, Merge, etc.

matthewvon commented 6 years ago

Ping ... just letting you know I have not forgotten ... just distracted. Should be active on this again Monday or Tuesday (2/19 or 2/20).