sjmueller commented 4 years ago

We have two api nodes in our cluster and have Nebulex v2.0.0-rc.0 setup with Nebulex.Adapters.Partitioned. Under regular circumstances, accessing the cache is decently fast, under 50ms. But under semi-heavy load, we had put/delete transactions that are taking 2+ seconds. Originally we thought adding keys to the transactions would help, but the performance continued to be subpar. So we removed the transactions entirely and still have the same problem!

Some details about our setup:

We are using adapter Nebulex.Adapers.Partitioned with this configuration:


defmodule Domain.NebulexCache do
use Nebulex.Cache,
otp_app: :domain,
adapter: Nebulex.Adapters.Partitioned
end

config :domain, Domain.NebulexCache, primary: [

=> 1 day

gc_interval: 86_400_000,
backend: :shards,
partitions: 2

]

- We're using Nebulex in pretty standard circumstances, with commands like:
``` elixir
result =
  Enum.reduce(Domain.Repo.all(members_count_query), %{}, fn item, map ->
    Map.put_new(map, "ConversationMembersCount:#{item.conversation_id}", item)
  end)

NebulexCache.put_all(result, on_conflict: :override)

NebulexCache.get("ConversationMembers:#{conversation_id}")

Under small load of <100 simultaneous users, almost all cache actions execute in <1ms with some outliers up to 20ms which is performance that we would expect
When we have sudden semi-heavy load (e.g. after a mass push notification where 500 people open the app at the same time) the cache gets incredibly slow, resulting in data not returning back to everyone’s app for up to 1 minute (!!!)
You can see how all cache calls start to balloon here to beyond 1s and we've even seen longer, pushing 3-5s and higher even without using transactions:
We have checked CPU utilization on the api nodes, even under the heaviest load the peak is less that 38%

As you can imagine, this is really hampering our ability to scale with our app growth! We have tried to move to a simpler, single node Redis setup that avoids partioning/replication using the official adapter, but v2.0.0-rc.0 [compatibility has stopped us].(https://github.com/cabol/nebulex_redis_adapter/issues/21) Any help would be appreciated!

cabol commented 4 years ago

Hey @sjmueller , first of all, thanks for the very detailed explanation and information, it will be very helpful to reproduce the scenario and see what could be happening. I'll dig into it!

On the other hand, yes, the Redis adapter is not compatible with v2 yet, but it is my top priority now, aiming to push the fixes so NebulexRedisAdapter can be compatible with v2 as soon as possible, most likely it will be ready by the end of the next week (maybe before 🤞 ).

I'd suggest two quick tests, 1) change the backend to :ets, let's discard it is something related to :shards. 2) Try with :shards but increasing the partitions, for example, maybe let Nebulex resolve the partitions with the default value System.schedulers_online(), or just increase it.

BTW, out of curiosity, did you have this issue with the previous version 1.2 or 1.1? to is it something new with v2?

sjmueller commented 4 years ago

Hi @cabol thanks for the quick response. We switched directly from mnesia to nebulex 2.0.0-rc.2 so there's no comparison agains v1.2. However, what we **just now** decided to try is Nebulexv1.2.2` with the redis adapter, it's going to production shortly so we can tell you how that goes

sjmueller commented 4 years ago

Ok we are using an ElastiCache Redis instance in AWS now with Nebulex v1.2.2 and the redis adapter, the results were much better. Under peak load only some calls went above 50ms, but nothing went over 125ms. While we'd love if nothing went over 50ms, these are within acceptable limits for us and certainly much better than what was happening with v2.0.0-rc.0 and the Partitioned adapter.

cabol commented 4 years ago

Thanks for the feedback! I was checking out the partitioned adapter implementation in v2 and v1.2 and there are no big differences implementation-wise, both use the same Nebulex.RPC util for distributed tasks, same approach, so I think this situation will be the same regardless of the version, but I'll confirm it anyway. I'll continue with the Redis adapter for v2 and keep u posted!

cabol commented 3 years ago

Hey! I did several benchmark tests with the partitioned cache (it is mostly a first attempt to identify any kind of issues with the partitioned adapter). Using benchee I ran the next test scenarios:

Benchee:

Operating System: macOS
CPU Information: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
Number of Available Cores: 16
Available memory: 16 GB
Elixir 1.10.2
Erlang 22.3
warmup: 2 s
time: 3 min
memory time: 0 ns
parallel: 16
inputs: Partitioned Cache
Estimated total run time: 6.07 min

Scenario 1

Partitioned cache, 3 nodes (running on local), :shards as the backend for the primary store and 16 partitions. put_all with 10 entries:

Scenario 2

Partitioned cache, 3 nodes (running on local), :shards as the backend for the primary store and 16 partitions. put_all with 100 entries:

Scenario 3

Partitioned cache, 3 nodes (running on local), :shards as the backend for the primary store and 16 partitions. put_all with 1000 entries:

Insights

It is important to keep in mind that for c:pul_all/2, what the partitioned adapter does internally is traversing the given entries and group them by node based on the key, and then perform the action on the different nodes. Hence, the larger the number of entries is the longer the latency or execution time will be (the partitioned adapter does an extra logic before to perform the insert itself against the primary store).
As you can see in the results, the latency for c:pul_all/2 increases "significantly" depending on the number of entries to store, but even though, with 1000 entries the average latency is still below 20 ms and the max below 50 ms.
As you can see, I just added c:get/2 and c:put_all/2 to the bench tests, because are the ones we are interested in, besides, c:put_all/2 uses c:put_all/3 under-the-hood. However, I also ran the bench test for other functions, and the latencies all below 20 ms.
Also, I ran some extra tests using basho_bench, and yet the latencies were below 20 ms.
Of course, the 3 nodes run locally, so we can say there is no network latency involved, but I think if there is an issue related to the partitioned adapter itself these test should have revealed something, but still, I will try to run a test with 2 or 3 physical nodes. BTW, remember Nebulex includes the bench tests, so you can also help me to run it on physical nodes if it is possible – I just let the partitioned cache and the function get and put_all, other than that I commented it.

About your use-case

I was thinking in your use-case, you have:

result =
  Enum.reduce(Domain.Repo.all(members_count_query), %{}, fn item, map ->
    Map.put_new(map, "ConversationMembersCount:#{item.conversation_id}", item)
  end)

NebulexCache.put_all(result, on_conflict: :override)

NebulexCache.get("ConversationMembers:#{conversation_id}")

Questions

How Domain.Repo.all(members_count_query) changes regarding the number of users? For example, how many entries (on average) you are inserting with put_all/2 when you have 10, 100, 500 users, and so on? Overall, what is the num of entries you are trying to insert when you have the 500 users (or more)?
In terms of size, how that big is each entry? Just to have an idea about how that large could be the total size when performing the bulk insert.

cabol commented 3 years ago

@sjmueller any feedback on this? As I explained in my previous comment, I did several bench tests but I couldn't reproduce the issue, maybe if you can give me more details about your scenario (check my questions in the prev comment)?

sjmueller commented 3 years ago

Hi @cabol, circling back here. It turns out there were some areas where we were caching full serialized objects, and doing so in sequential fashion. Example we might loop through and write 100 user objects to the cache for each api request, and this added up over simultaneous load. For some reason this performed much better with the redis adapter. Furthermore we’ve optimized these scenarios by using redis pipelines (via the nebulex adapter) so things are much more efficient now. Hope this helps.

cabol commented 3 years ago

Absolutely, it helps a lot, thanks for the feedback, I'm glad to hear you were able to sort it out by using the Redis adapter. That is precisely the idea of Nebulex, be able to choose the adapter and topology that fits better with your needs, like in this case. In fact, I remember I ran some benchmark tests with the Redis adapter using a Redis Cluster with 5 nodes and the partitioned adapter with the same nodes connected by Distributed Erlang/Elixir, and I got better results with the Redis one. But anyway, thanks again, this is very helpful because it gives me a better idea of the scenario, I will check and see if maybe we can improve the performance.

sjmueller commented 3 years ago

Honestly I love what you've built here with nebulex, because it models exactly the way I think about caching, i.e. the ability to annotate functions so that caching does it's job but not at the expense of the original contract. All this with flexibility and no lock-in! We're currently using nebulex in a more manual, centralized fashion but can't wait to set aside time and refactor to the idiomatic approach.

All the work you've done is greatly appreciated 🙏 keep it up!

cabol commented 3 years ago

Great to hear that 😄 ! And of course, there is a long TODO list yet!

cabol commented 3 years ago

Closing this issue for now. Once I have more information about it, if it can be improved somehow, I will create a separate issue for the enhancement.

cabol / nebulex

Slow cache under moderate simultaneous load #80

=> 1 day

Benchee:

Scenario 1

Scenario 2

Scenario 3

Insights

About your use-case

Questions