389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
211 stars 91 forks source link

Improve the cache behaviour of DS #2155

Open 389-ds-bot opened 4 years ago

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49096


http://www.port389.org/docs/389ds/design/cache_redesign.html

Once it's written, use the cache from libsds to improvement performance of Entry, DN and NDN cache.

This is a long term goal, so we can mark it for 1.3.7 or later.

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2017-02-11 22:52:49

Metadata Update from @Firstyear:

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2017-03-10 02:14:01

Metadata Update from @Firstyear:

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2017-05-05 01:13:07

Metadata Update from @Firstyear:

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2017-10-18 21:34:05

Metadata Update from @mreynolds389:

droideck commented 1 year ago

@Firstyear what do you think about the RFE these days?

Firstyear commented 1 year ago

Yeah, this is desperately needed. LMDB is based on copy-on-write which means you have single writer, and many small finegrained readers. Having a cache and data-structures to match this is critical to performance.

Not only that, so many bugs are caused in 389-ds because of the lack of proper read transaction isolation within operations - data can change under you at any second, meaning that there are a ton of subtle inconsistencies that can exist. Look at the recent effort for cache pinning an entry that needs to be locked into the cache then unlocked - none of this would be needed with proper transactions.

Another classic example here is the loss in read throughput once a write occurs - isolating reads and writes would completely eliminate this.

At the same time though, this would be a huge piece of work, with huge scope and investment required. As you've probably guessed, outside of some small maintenance patches and code review, my priorities are elsewhere within SUSE for the future. So I can't contribute significantly to this beyond the ideas and the justification.

Firstyear commented 1 year ago

Also update re the cache, see: https://github.com/kanidm/concread/blob/master/CACHE.md

We're already using it in the NDN cache, it would be perfect for the entry and dn caches. And you can then build transactions out ontop.

tbordaz commented 1 year ago

I agree with you @Firstyear that caching is critical for performance. As you mentioned caches are complex and deeply implemented in the server. The benefit of a change requires some testcases (dataset, loads, diag tools, measures), to identify the bottleneck and verify that the fix reduce the bottleneck. IMHO this ticket would start with the identification of testcases.

progier389 commented 1 year ago

I agree with @Firstyear: There is a very large piece of work ahead of us. And IMHO it will be better to wait until bdb is dropped to do that. (because it will simplify things a lot if we do not have to take care of bdb ). My feeling is that we should think about these changes globally:

  1. DB entry format (Maybe we should have a format easier to parse (i.e: more near of the entry cache format but we relative index instead of offsets)
  2. Entry Cache. Knowing that db entries are already in memory, if the entry format is similar to the entry cache, do we still need to have an entry cache ? (the cache may be an overkill)
    • If yes, do we need it for all entries or only for special ones (Like large groups ?)
    • how do we implement it. (And this issue is providing some interesting thought)
  3. Transaction model. we cannot afford any more to have long write txn as it may happen with the current design But we do not want either to break our plugins model About the testcases: IMHO we could start:
    • with a generic benchmark with standard entries
    • a benchmark with aci involving large group handling
      Then afterwards if we notice an issue we will have the scenario and can work to improve thing.
Firstyear commented 1 year ago

I agree with @Firstyear: There is a very large piece of work ahead of us. And IMHO it will be better to wait until bdb is dropped to do that. (because it will simplify things a lot if we do not have to take care of bdb ). My feeling is that we should think about these changes globally:

  1. DB entry format (Maybe we should have a format easier to parse (i.e: more near of the entry cache format but we relative index instead of offsets)

This isn't super needed if we have an entry cache though. The point of a DB format is to store things in a way that is good for the DB, that represents the data, and can allow changes and versioning of the data within that DB.

The entry format that we deserialise into should be optimised for our runtime lookups and how we access things internally.

These are different priorities, and conflating them means you end up doing the worst of both. For example a hashmap to store attributes in the entry cache is going to be great for lookups, but it adds overhead into the db and adds complexity to serialisation. Similar storing things in the db in a compact way means we lose the ability for better datastructures for in memory operations.

What we actually need is an idl cache. Currently we always go to disk for this, and it will be much faster if we had these ready to go in memory.

  1. Entry Cache. Knowing that db entries are already in memory, if the entry format is similar to the entry cache, do we still need to have an entry cache ? (the cache may be an overkill) - If yes, do we need it for all entries or only for special ones (Like large groups ?) - how do we implement it. (And this issue is providing some interesting thought)

This is one place where I really disagree with the authors of lmdb. lmdb relies on mmap and the linux vfs for caching it's pages, but these are the first pages to be evicted under pressure on the vfs. Additionally the linux vfs is extremely slow and only uses simplistic lru structures meaning that it's vulnerable to various cache invalidation attacks and patterns. Finally it means in something like docker we can't actually guarantee we have vfs available in that container.

Reserving memory for ourself avoids this and guarantees that we have entries in memory.

Second, lmdb is optimised for disk storage. It has large 4k page nodes which is excellent for disk writes and reads, but inside of those 4k pages key look ups are don't with a binarysearch as the nodes are wide. binary search is awful on a modern cpu cache since cpu cache lines are 64bytes wide. So that one lmdb page, assuming half the page (2k) is keys and the other half is links to other nodes in the lmdb tree, would have a 2k chunk (256 keys) that needs bst over it. This would be made of 32 cache lines, and a bst over that would require 4 to 5 cpu cache loads. This puts a higher pressure on the l1/l2 compared to a native and tuned structure that is cpu cache aware. And that's just one page from lmdb, let alone if we have to descend deeper into the tree. Were this a native memory structure this would require 3 cache lines to lookup, and it's extremely likely the top level and second level would be resident in the l1/l2. The difference is staggering.

  1. Transaction model. we cannot afford any more to have long write txn as it may happen with the current design But we do not want either to break our plugins model About the testcases: IMHO we could start: - with a generic benchmark with standard entries - a benchmark with aci involving large group handling Then afterwards if we notice an issue we will have the scenario and can work to improve thing.

We can't even afford to have short writes in our current design, let alone long ones. We don't even have a transaction model, we just open small locks all over the place. So if we want to be serious about performance improvement we need to isolate reads from writes so that lmdb and transactional structures can properly shine in our codebase. Otherwise we're just hindering ourself.