Closed jlledom closed 1 month ago
For some reason the call to Storage.instance.get(stats_key)
in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.
It needs to be fixed
For some reason the call to
Storage.instance.get(stats_key)
in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.It needs to be fixed
@eguzki This fixed it: https://github.com/3scale/apisonator/pull/350/commits/c94bd275d67261aff1cd27ed42bf12e3a3bff7ef
What do you think?
You are saying that multiple threads (were) are using the same async redis client. Which threads are those?
@eguzki
You are saying that multiple threads (were) are using the same async redis client. Which threads are those?
For the failing test, there are three threads accessing the storage in parallel:
here are three threads accessing the storage in par
What changed in your PR to be broken? It makes all the sense that an async client is not shared between threads. Is it shared in (current) master
branch? I understand that in the worker, one thread is reading the queue and another one is processing the tasks. It works with one async client shared among threads. If it is not supported, how is it working?
If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.
What changed in your PR to be broken?
I don't know.
Is it shared in (current)
master
branch?
Yes
If it is not supported, how is it working?
I don't know, maybe I'm wrong. Ideas?
If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.
The tests are the same as before, if they passed before, they should pass now to ensure the behavior is the same.
@akostadinov I created this script to prove async_redis
doesn't allow sharing the connection between threads:
# frozen_string_literal: true
require "bundler/inline"
gemfile(true) do
source "https://rubygems.org"
git_source(:github) { |repo| "https://github.com/#{repo}.git" }
gem 'concurrent-ruby', '~> 1.1.6'
gem 'hiredis', '~> 0.6.1'
gem 'async-redis', '~> 0.7.0'
gem 'redis', git: 'https://github.com/3scale/redis-rb', branch: 'apisonator'
end
require 'concurrent'
require 'async/io'
require 'async/redis/client'
require 'hiredis'
require 'redis'
def async_thread_isolate
client = Concurrent::ThreadLocalVar.new do
endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
Async::Redis::Client.new(endpoint, limit: 5)
end
Async { client.value.set("random_key", "a_key") }.wait
result = []
5.times.map { Thread.new {
Async { result.push client.value.get("random_key") }
}}.each(&:join)
puts result
end
def async_thread_share
endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
client = Async::Redis::Client.new(endpoint, limit: 5)
Async { client.set("random_key", "a_key") }.wait
result = []
5.times.map { Thread.new {
Async { result.push client.get("random_key") }
}}.each(&:join)
puts result
end
def sync_thread_share
client = Redis.new({url: 'redis:localhost:22121'})
client.set("random_key", "a_key")
result = []
5.times.map { Thread.new {
result.push client.get("random_key")
}}.each(&:join)
puts result
end
method = ARGV.first || "async_thread_isolate"
send(method)
I run it from inside the apisonator container, like this (I saved the script as tmp/async_tests.rb
under the apisonator source folder):
jlledom@fedora apisonator:master=]$ make dev
[ruby@apisonator-dev apisonator]$ bundle exec script/services start
ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb sync_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using bundler 2.3.5
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key
[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_isolate
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using bundler 2.3.5
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using hiredis 0.6.3
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-io 1.38.0 (was 1.34.3)
Using async-pool 0.4.0 (was 0.3.12)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key
[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using bundler 2.3.5
Using fiber-local 1.0.0
Using nio4r 2.6.1 (was 2.5.9)
Using timers 4.3.5
Using protocol-redis 0.6.1
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using fiber-annotation 0.2.0
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
0.7s warn: Async::Pool::Controller [oid=0xab4] [ec=0xac8] [pid=263] [2023-11-28 16:19:03 +0000]
| Closing resource while still in use!
| {"resource":"#<Async::Redis::Protocol::RESP2::Connection:0x00007fd8d0072ac8>","usage":1}
0.7s warn: Async::Task [oid=0xadc] [ec=0xaf0] [pid=263] [2023-11-28 16:19:03 +0000]
| Task may have ended with unhandled exception.
| NoMethodError: undefined method `zero?' for nil:NilClass
| → /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:236 in `reuse'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:112 in `release'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `ensure in acquire'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `acquire'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-redis-0.7.0/lib/async/redis/client.rb:119 in `call'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/protocol-redis-0.6.1/lib/protocol/redis/methods/strings.rb:62 in `get'
| tmp/async_tests.rb:46 in `block (3 levels) in async_thread_share'
| /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-1.31.0/lib/async/task.rb:261 in `block in make_fiber'
a_key
a_key
a_key
a_key
$ruby tmp/async_tests.rb async_thread_share
returns a different problem each time you run it.
I haven't found any mentions to thread-safety in the async-redis
repo, but async-redis
is built on top of async, which according to this issue doesn't allow sharing resources between threads.
It seems logical to me: an Async::Redis::Client
instance will have it's own @pool
(https://github.com/socketry/async-redis/blob/v0.8.0/lib/async/redis/client.rb#L38), which we are sharing between threads like in the issue I mentioned above.
I still don't have an answer to why does it work in production, where Async::Redis::Client
is also shared among threads.
I'm inclined to think that the reason not to fail in production is either that we don't really use threaded workers there, or load is not high enough or pure luck. I agree that async
framework is not thread safe but with some changes to your test bed. Please bear with my long post.
At this part there is a small issue with the test bed.
5.times.map { Thread.new {
Async { result.push client.get("random_key") }
}}.each(&:join)
puts result
We wait for each thread but don't wait for each Async task. So Threads can die and the program may exit before all async tasks have completed. And we may (or may not) see the error message you posted above.
If we add a #wait
call to the async task, then things always work. I created a gist so I'm able to link diffs and code lines. Here's the modification I'm talking about:
https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions?diff=unified&w=#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544fR48
But a concurrency of 5 is not very stressing. So I changed it a little to create 55 async tasks in 55 threads, then wait for all threads and then for all async tasks before printing the results. Like this:
55.times.map { Thread.new {
Async { result.push client.get("random_key") }
}}.map(&:value).map(&:wait)
puts result
And the result was quiet telling. First of all I saw in console several messages:
| Task may have ended with unhandled exception.
| FiberError: fiber called across threads
And additionally the process has hanged indefinitely until I interrupted it with Ctrl+C
. And btw if I run the same code with 5 threads/tasks, then it completes alright.
Needless to say, the isolated client per thread method worked without issue, see https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544f
Unfortunately I'm looking at the async
framework for the first time but it appears to me that it is a known fact that these async tasks are not thread safe. And tasks and pools shouldn't be shared across threads. While looking at your PR I spotted an async project issue where the resolution was:
the pool created in one thread may have been memoized and used from another thread. I removed the memoization and it fixed the issue
This reads to me as this is a known design limitation. So overall the taken approach by you - create an async client per thread and avoid sneaky modifications of instance variables, are the right (if not the only) way to go forward. Thank you for doing a great detective job and for the nice test bed for thread safety.
An interesting question still is, why test in the master
branch didn't fail. So I think it is worth spending some more time checking whether the updated code is fully sound. Or is it just a difference of the generated instructions? We can try increasing the concurrency of the original test in master
and see if it's gonna fail at some point.
P.S. I'm running the script locally, just using Ruby 3.0.4, it is reproducible like that just fine at least for me.
@akostadinov Thanks for your research. In fact, calling .wait
for tasks makes all tests always pass with 5 threads, but only the async_thread_shared
test fail with 55 threads.
About why it doesn't fail in master, I have a couple of theories.
In worker_async_spec.rb
(the test failing) we have the worker running in a separated thread, but that doesn't represent production. In production the worker runs in its own process. That process has two threads, one to read from the queue (code), another one to process the jobs (code). The thread processing the jobs connects to redis through the shared instance (code). But the thread reading from the queue access redis using the JobFetcher
redis instance (code) which is initialized once during the startup (code), and it creates a new instance (code) instead of taking the shared one. Observe the comment over redis_client
(code). So basically there are not shared client among threads in the worker, the fetcher has it's own instance and the processor it's taking the shared one and not sharing it with anybody else because there are no more threads.
About the listener process, I'm not sure how multithreaded is it really, I guess Puma creates a thread for every request, but as proven in the test script we wrote, it can cope fine with a certain amount of parallelism without failing, maybe we never reach that workload in production. On the other hand, there's that problem about all threads sending their commands to the pipeline whenever any thread opens any pipeline anywhere, and those commands are eventually sent to redis one after another from a single thread when the pipeline closes. That would contribute to reduce the number of situations where there are multiple threads actually accessing async_redis
at the same time.
Additionally running worker_async_spec.rb
with n_reports = 150
instead of the original 10
show the following error in console as well deadlocks the whole process
10.14s warn: Async::Pool::Controller [oid=0x7cec] [ec=0x2184] [pid=1623621] [2023-11-30 00:47:58 +0200]
| {"resource":"#<Async::Redis::Protocol::RESP2::Connection:0x00000000035c2fd0>","usage":1}
| Closing resource while still in use!
To me this shows that the original version also exhibited thread unsafety (as expected). For whatever reason it manifested somehow not as prominent. But still unacceptable.
I'd like to explore 2 areas of further development here. Validate better the new async implementation. Or possibly remove async altogether if we can observe that the sync implementation has similar performance to async.
If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.
Probably at runtime we actually avoid sharing the client between threads [1]. So the test seems inconsistent with runtime environment in that regard. I'm concerned though whether a client may leak between threads in future development or runtime configuration changes and I'm inclined to err on the side of safety. Especially while this project is new to us.
@eguzki , do you have confidence that we never access same storage client from multiple threads?
@akostadinov
Additionally running
worker_async_spec.rb
with n_reports = 150 instead of the original 10 show the following error in console as well deadlocks the whole process
True. Running worker_async_spec.rb
on master
with CONFIG_REDIS_ASYNC=true
raises the Fiber called across threads
error among others, also inside the make dev
container That's the confirmation that master
is broken also.
Probably at runtime we actually avoid sharing the client between threads [1]
Apisonator runs in two separated processes (even different pods in openshift): the worker and the listener. That comment is only about the worker. It doesn't affect the listener, where Puma usually opens a new thread for each request, and theoretically any thread could access the async_redis
client.
However, it seems we explicitly configure Puma to run in single-threaded mode. If I'm not wrong, the number of workers and threads in puma is configured here:
The value for server_model
is set in the manifest file, here:
The number of workers is taken from the LISTENER_WORKERS
env variable, which is set to 1
in production (you can verify this in our Openshift). The number of threads per worker is set to [1,1]
because thread_safe?
is set to false:
The comment above points out how they knew the entire backend is not thread safe. So everything is clear now:
async
with fibers within a single threadworker_async_spec.rb
is wrong because it should avoid multthreading as in productionConcurrent::ThreadLocalVar
hack because it probably won't fail in productionToday I've been investigating something I don't understand. When running apisonator locally with CONFIG_REDIS_ASYNC=true
, I always get a No async task available!
error when receiving the first request from porta.
According to some issues I found, I assume all calls to @async_redis
must be done under and Async { ... }
block. Async-redis docs seem to confirm that, as all examples are wrapping client calls under an Async
block.
That's probably the reason to this block that wraps every test:
Or this one for specs (nice comment, btw):
It all makes sense, but now I'd like to now why we are not seeing this No async task available!
error on production...
¯\(ツ)/¯
I'll let you have fun with this @akostadinov
Hi @jlledom @akostadinov I was just following your thread.
I'm not a ruby expert at all, but I would like to share I few things I know from production that might not be 100% accurate but I think is worth commenting while you work on this issue:
async-pool
introduced a mem leak, and also made performance went down like 40%, and the only reason was a change on redis connection pool management on the library (that apparently was an innocent change), fixed I think at https://github.com/3scale/apisonator/pull/324@slopezz thanks for your comments.
* AFAIK async mode has some limitations like only permitting 1 single worker https://github.com/3scale/apisonator/blob/master/docs/async.md#limitations
Shit, I would have saved some time if I have known about this doc before 😔.
* However there are so many advantages having just 1 worker that makes having a single worker worth it:
I assume by "worker" you mean one process. A process can have many threads in parallel and each thread can have many concurrent fibers. Right now we stick to 1 process -> 1 Thread -> Many fibers. So I think there's room for improvement on performance, and still keeping the single process so all your points below still meet.
Anyway this is all just ideas, in this PR I'll do the less changes as possible to implement TSL.
* we handle up to 10,000 requests/second with 95th percentile latency under 80ms
Did you use the sync mode before? did you notice a performance boost in migrating from sync to async mode?
* A year ago a library update of `async-pool` introduced a mem leak, and also made performance went down like 40%, and the only reason was a change on redis connection pool management on the library (that apparently was an innocent change), fixed I think at [THREESCALE-7864 async-pool v0.3.12 #324](https://github.com/3scale/apisonator/pull/324)
I upgraded that gem in that PR because "Why not?", but I'm removing the upgrade, it's not needed for this PR.
I assume by "worker" you mean one process. A process can have many threads in parallel and each thread can have many concurrent fibers. Right now we stick to 1 process -> 1 Thread -> Many fibers. So I think there's room for improvement on performance, and still keeping the single process so all your points below still meet.
I don't know the internals, what I know is that we change a few things of the containers compared to default sync
mode:
LISTENER_WORKERS = 1
CONFIG_REDIS_ASYNC = true
Did you use the sync mode before? did you notice a performance boost in migrating from sync to async mode?
We switched from legacy virtual machines to OCP-async directly. However a few performance tests (handle rps) were done by me and David (ex apisonator maintainer) on the early stages of OCP testing, I have just found a couple of examples :
https://github.com/3scale/infrastructure/issues/698#issuecomment-517296719
https://github.com/3scale/infrastructure/issues/698#issuecomment-542228578
I upgraded that gem in that PR because "Why not?", but I'm removing the upgrade, it's not needed for this PR.
I think it is OK updating libs in general to keep you code up-to-date, however for the case of apisonator where high performance is expected, in the past we have seen that innocent updates can break things, so it is important to do good testing on staging environment before promoting an image to production, like done by Eguz on https://github.com/3scale/apisonator/pull/324 or even use tools like hey to do some simple load testing in staging before and after updating the image, to check if the handle traffic is similar on the 2 images.
Actually, with our saas-operator we can easily do canary testing on staing or production environments, meaning that we can test 2 different images with live traffic deciding how many pods of each image run at the same type, and deciding if we want (or not) sent traffic to this canary deployments.
So when some changes are introduced in the code and we are not 100% sure they are safe because they cannot be properly tested in staging, we can use that saas-operator canary feature to introduce a limited number of pods with a newer image, and get stats from prometheus... to see if performance is acceptable or not.
Today I've been investigating something I don't understand. When running apisonator locally with CONFIG_REDIS_ASYNC=true, I always get a
No async task available!
error when receiving the first request from porta.
I answer to myself here. In production, the async mode is launched with Falcon instead of Puma. Falcon is ready for use with the async libraries we use in the async mode, like async
, async-io
and async-redis
.
I added some tests for TLS and ACL connections. However they can't run inside the container because the image (quay.io/3scale/apisonator-ci:20230418-02) installs Redis 5, which doesn't support TLS not ACL.
I create an issue for that, as I'm not sure upgrading the CI image is under the scope of this PR.
IDK how was this implemented but I see an upstream example protocol wrapper for using TLS and logical databases that easily worked for me. Not sure how easy same can apply to sentinels though
see my lazy PoC https://github.com/3scale/apisonator/issues/135#issuecomment-2020485196
Closing this PR. I created https://github.com/3scale/apisonator/pull/391 to make this easier to review:
This is a work in progress to allow Apisonator to connect to Redis using ACL and TLS credentials.
Jira Issue:
THREESCALE-8404
Summary of changes:
redis
4.3.1 fork and use upstreamredis
5.0.7:sadd
returned a boolean inredis
4.3.1, but now they return what they get from the Redis server.redis
5.0.7 includes boolified versions of those methods, like:sadd?
, which behave like the old:sadd
redis
4.6.0 introduced some changes in thepipelined
method, (Check Changelog).redis
, andasync-redis
gems provide different interfaces. We have an async client that reproduces theredis
interface and translates it toasync-redis
calls.async-redis
client (https://github.com/3scale/apisonator/pull/350/commits/c94bd275d67261aff1cd27ed42bf12e3a3bff7ef)redis
gem, the async mode uses theasync-redis
gem.async-redis
hasn't been updated, but I modified our client to add support for TLS and ACL.Connecting to Redis using TLS and ACL is done. Running Apisonator in sync mode works fine and all tests pass in sync and async mode.
I'm blocked due to a deadlock in the async mode which couldn't fix yet. To reproduce it, just run the tests with
make test
. The test failing is spec/integration/worker_async_spec.rbPending tasks:
test/unit/storage_sync_test.rb
to add tests for TLS connections..3scale_backend.conf
file.ca_path
ssl paramredis
andresque
forks and started using the upstream gems.ERR AUTH
errorAbout the forks mentioned above, the problem is we were using a forked version of the
redis
gem before: https://github.com/3scale/redis-rbThat fork included 3 PRs which are not upstream:
Since the fork is stuck in
redis-rb 4.1.3
and we need>= 5
. Those three PRs are lost in this PR, so we need to investigate what were they for and whether are they still required.The same happens with
resque
. We were using a fork before: https://github.com/3scale/resqueThe fork included one PR and a few commits not in upstream:
Configure server and client
This requires three steps:
Generate keys and certificates:
The first thing we need is a certification authority. For that, we need to create its key and certificate.
openssl genrsa -out ca-root-key.pem 4096
openssl req -new -x509 -days 365 -key ca-root-key.pem -out ca-root-cert.pem
Then we'll have to create a key and certificate for the server, and sign it with our CA. Ensure setting
localhost
asCN
, if that's the domain you're going to use to connect to Redis:openssl genrsa -out redis.key 4096
openssl req -new -key redis.key -out redis.csr
openssl x509 -req -days 365 -in redis.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out redis.crt
Configure Redis:
First, we need a Redis configuration file which enables ACL and TLS, defines the users and sets the certificate the server will use and the CA the server will trust. This is the minimal
redis.conf
for that purpose:Next step is to launch a container that has access to the certificates, keys and Redis config files created above. We create a volume for each file and modify the container start command to reference the configuration file we created. This is how I configured the redis pod in my
docker-compose.yml
. But the container can be launched by other ways:The
:z
at the end of the volume definitions is required when using Selinux, for instance in Fedora.Configure Apisonator:
Apisonator can be configured as a regular TLS client, when only the server needs a certificate, or a Mutual TLS Client, where both client and server must provide a certificate.
As TLS Client:
What I did is create a
~/.3scale-backend.conf
, based on openshift/3scale_backend.conf and set these properties:Note how the
ssl_params
sections are not always required. We are using them here because the server is using a certificate signed by an unknown authority we've just created, so we need to explicitly tell Porta to trust that authority. If the server were using a certificate signed by any of the well-known CAs, thessl_params
sections could be omitted.Another possible situation is when the server is using a self-signed certificate. When this happens, there's no trusted CA to add to
ssl_params
, so the only way to go is to skip certificate validation. Unfortunately we don't support this yet because theredis-client
gem still haven't added support for verify modes (see: https://github.com/redis-rb/redis-client/issues/133) so we can't setSSL_VERIFY_NONE
to the client. Anyway this is only useful for development purposes, we'll have to use our own CA for development for the moment.As Mutual TLS Client:
Generate a key and a signed certificate for Apisonator:
openssl genrsa -out apisonator.key 4096
openssl req -new -key apisonator.key -out apisonator.csr
openssl x509 -req -days 365 -in apisonator.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out apisonator.crt
Then update the config file to set the next properties:
As mentioned above, the
ca_file
fields are only required when the server isn't using a certificate signed by one of the well-known CAs.It could be also configured using the new ENV variables (see docs/configuration.md)