basho / leveldb

Clone of http://code.google.com/p/leveldb/
BSD 3-Clause "New" or "Revised" License
408 stars 182 forks source link

LevelDB Threads #236

Open nsaadouni opened 4 years ago

nsaadouni commented 4 years ago

Hi @matthewvon,

The latest version of leveldb "2.0.35" see's a move away from the erlang scheduler, and instead using a worker pool of threads that are created in C++ in leveldb. In the code I can see that the use of 71 threads has been decided upon.

From comments in the wiki, and the git logs, I saw that the number was decided to be a prime.

1) Is there a reason behind requiring this number to be a prime number? 2) Is there a reason behind having only 71 threads? 3) Can this be altered to be greater than 71 threads? 4) What will the pros & cons be of increasing the number of threads?

I ask due seeing net_kernel tick timeouts (stalls) in our test environment. We have set riak's sysmon long scheduler, and often see it is long timeouts in the riak_kv_vnode surrounding a gen_call. The only place I have found a gen_call to take place in the riak_kv_vnode is at riak_kv_index_hashtree:insert/3 (call once all the tokens have been used up). This at times can take > 10 seconds to complete.

matthewvon commented 4 years ago
  1. The thread pool is considered to be a circular list. Each scheduler thread's ID is divided by the number of threads (71) to create a starting index into the circular list. A prime number for the thread count reduces the possibility of two scheduler threads starting their hunt for an available worker at the same index.

  2. Each posix thread defaults to a stack size of 8Mbytes. So with 71 threads there is a default allocation of 568Mbytes of ram. Not all of the space is ever used, but it is something to keep in mind when trying to manage swap and disk cache space. A "developer mode" option exists that drops the thread count to 17 so that our Erlang programmers could run 5 instances of the server on one laptop for testing.

  3. This is open source code. Change it however you please. Just note that a given vnode typically has only one write request allowed at a time. Raising the thread count may therefore only help if your server is running more than 64 vnodes.

  4. The easiest solution is to try raising thread count to see if it helps your scenario. My guess is that it will not, but no harm in trying. You should post your actual symptoms against the riak_core to get thoughts from the Erlang programmers. I believe you are looking at the active anti-entropy (AAE) area of Riak. It may be "functioning as designed". You need Erlang eyes to give you better insight.

nsaadouni commented 4 years ago

We actually work on riak, and have gone through the AAE code several times to ensure that it is not due to the design, or the AAE code that could be providing this bottleneck. Which is why I was asking about how those threads work, as that is one of the major changes (I believe) to the leveldb dependancy, between riak's older stable versions (2.0.X) and the (2.2.5) riak releases.

I had also read some of your posts on the leveldb wiki in the past whereby changes to leveldb had caused customers using it as there 'storage backend' to experience these net_kernel tick timeouts.

We will keep on digging into it, thanks for taking the time to explain all of the above.

matthewvon commented 4 years ago

The leveldb code intentionally slows down the rate of user write operations when the volume of compactions gets too high. AAE will slam 1,000 or more keys into the system at once. There is some code to reduce the impact of the intentional write slow downs for AAE, but when leveldb is behind user needs to wait.

I suggest you look at disk write activity relative to the time you see net_kernel tick counts ... which by the way, I do not know net_kernel statistic ... at least not by that name.

matthewvon commented 4 years ago

Oh, and there is a known bug in the Erlang AAE code where two processes share the same leveldb iterator token. This is a really bad thing. It has lead to segfault crashes. Was not fixed before Basho closed. The eleveldb code has mutexes to help defend against this bad Erlang code. Sitting on one of those mutexes could result in the waits you are seeing. I believe the impact was typically seen during an AAE tree rebuild ... which I think posts status to one of Riak's command line tools. Again, there is heavy disk activity during the tree rebuild.