THREESCALE-8404: Add ACL and TLS support for Redis

jlledom commented 9 months ago

This is a work in progress to allow Apisonator to connect to Redis using ACL and TLS credentials.

Jira Issue:

Summary of changes:

Update gems (https://github.com/3scale/apisonator/pull/350/commits/89fe2ef53ce6bc0099646dd1c47d4e3d8e152a86)
- To add support to TLS
- Stop using redis 4.3.1 fork and use upstream redis 5.0.7
Adapt code to the new boolifyed Redis API (https://github.com/3scale/apisonator/pull/350/commits/bde5ea15fee380ce701ab9d0742bdaa269092cbf)
- Some methods like :sadd returned a boolean in redis 4.3.1, but now they return what they get from the Redis server.
- redis 5.0.7 includes boolified versions of those methods, like :sadd?, which behave like the old :sadd
- Also, now the redis client only accepts string as parameters
Adapt the code to the new pipelined Redis API (https://github.com/3scale/apisonator/pull/350/commits/06776385f872b008d1982d3598b79156a7a67150)
- redis 4.6.0 introduced some changes in the pipelined method, (Check Changelog).
- Now it doesn't run Redis commands from the original client when under a pipelined scope. Commands can only be sent from the pipelined context.
Update the async client interface to match the new redis API (https://github.com/3scale/apisonator/pull/350/commits/08e5a3b47761d47c0d7ea588ce2d310ab6029753)
- We want the business logic code to be equally valid in both sync and async modes. But redis, and async-redis gems provide different interfaces. We have an async client that reproduces the redis interface and translates it to async-redis calls.
- Our async client interface has been updated to include calls to new boolyfied methods and use the correct pipelined context.
- EDIT: I updated the pipeline logic and thread-isolated the async-redis client (https://github.com/3scale/apisonator/pull/350/commits/c94bd275d67261aff1cd27ed42bf12e3a3bff7ef)
Fix tests (https://github.com/3scale/apisonator/pull/350/commits/be9c1e6bc780c37d78fb85baf25eeceee6376b19)
- Updated the tests to adapt to the changes
Add new accepted params and environment variables (https://github.com/3scale/apisonator/pull/350/commits/b99e3075ca98f84c1aa0b68b556e27f45c688ccc)
- Add parameters to configure SSL Context
- Provide environment variables to set those params
- Update documentation
EDIT: Add support for ACL and TLS in async mode (https://github.com/3scale/apisonator/pull/350/commits/8d8c4af5753c7bea0779143f616afe122a6d6e69)
- Instead of the redis gem, the async mode uses the async-redis gem. async-redis hasn't been updated, but I modified our client to add support for TLS and ACL.

Connecting to Redis using TLS and ACL is done. Running Apisonator in sync mode works fine and all tests pass in sync and async mode.

I'm blocked due to a deadlock in the async mode which couldn't fix yet. To reproduce it, just run the tests with make test. The test failing is spec/integration/worker_async_spec.rb

Pending tasks:

[x] Fix the async deadlock.
[x] Implement TLS in the async mode (check docs)
[x] Update test/unit/storage_sync_test.rb to add tests for TLS connections.
[x] Ensure all tests pass for sync and async modes.
[x] Actually set the ENV variables and check it works, so far I only tried editing the .3scale_backend.conf file.
[x] Add a new parameter and env. variable to set the ca_path ssl param
[x] Investigate the consequences of having stopped using a redis and resque forks and started using the upstream gems.
[x] Fix ERR AUTH error

About the forks mentioned above, the problem is we were using a forked version of the redis gem before: https://github.com/3scale/redis-rb

That fork included 3 PRs which are not upstream:

Since the fork is stuck in redis-rb 4.1.3 and we need >= 5. Those three PRs are lost in this PR, so we need to investigate what were they for and whether are they still required.

The same happens with resque. We were using a fork before: https://github.com/3scale/resque

The fork included one PR and a few commits not in upstream:

Configure server and client

This requires three steps:

Generate TLS certificates and keys for Redis.
Configure Redis to enable ACL and TLS, and use the generated keys.
Configure Apisoator. 3.1. TLS Client 3.2. Mutual TLS Client

Generate keys and certificates:

The first thing we need is a certification authority. For that, we need to create its key and certificate.

key: openssl genrsa -out ca-root-key.pem 4096
cert: openssl req -new -x509 -days 365 -key ca-root-key.pem -out ca-root-cert.pem

Then we'll have to create a key and certificate for the server, and sign it with our CA. Ensure setting localhost as CN, if that's the domain you're going to use to connect to Redis:

key: openssl genrsa -out redis.key 4096
cert request: openssl req -new -key redis.key -out redis.csr
signed cert: openssl x509 -req -days 365 -in redis.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out redis.crt

Configure Redis:

First, we need a Redis configuration file which enables ACL and TLS, defines the users and sets the certificate the server will use and the CA the server will trust. This is the minimal redis.conf for that purpose:

port 6379
tls-port 26379
tls-cert-file /etc/redis.crt
tls-key-file /etc/redis.key
tls-ca-cert-file /etc/ca-root-cert.pem
tls-auth-clients optional
user default off
user porta on >sup3rS3cre1! ~* &* +@all
user apisonator on >secret#Passw0rd ~* &* +@all

Next step is to launch a container that has access to the certificates, keys and Redis config files created above. We create a volume for each file and modify the container start command to reference the configuration file we created. This is how I configured the redis pod in my docker-compose.yml. But the container can be launched by other ways:

  redis:
    image: redis:6.2-alpine
    container_name: redis-compose-ssl
    ports:
      - "26379:26379"
    volumes:
      - /home/jlledom/redis.conf:/etc/redis.conf:Z
      - /home/jlledom/redis.crt:/etc/redis.crt:z
      - /home/jlledom/redis.key:/etc/redis.key:z
      - /home/jlledom/ca-root-cert.pem:/etc/ca-root-cert.pem:z
    command: [redis-server, /etc/redis.conf]

The :z at the end of the volume definitions is required when using Selinux, for instance in Fedora.

Configure Apisonator:

Apisonator can be configured as a regular TLS client, when only the server needs a certificate, or a Mutual TLS Client, where both client and server must provide a certificate.

As TLS Client:

What I did is create a ~/.3scale-backend.conf, based on openshift/3scale_backend.conf and set these properties:

config.queues.master_name = "rediss://localhost:26379/5"
config.queues.username = "apisonator"
config.queues.password = "secret#Passw0rd"
config.queues.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem"
}
config.redis.proxy = "rediss://localhost:26379/6"
config.redis.username = "apisonator"
config.redis.password = "secret#Passw0rd"
config.redis.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem"
}

Note how the ssl_params sections are not always required. We are using them here because the server is using a certificate signed by an unknown authority we've just created, so we need to explicitly tell Porta to trust that authority. If the server were using a certificate signed by any of the well-known CAs, the ssl_params sections could be omitted.

Another possible situation is when the server is using a self-signed certificate. When this happens, there's no trusted CA to add to ssl_params, so the only way to go is to skip certificate validation. Unfortunately we don't support this yet because the redis-client gem still haven't added support for verify modes (see: https://github.com/redis-rb/redis-client/issues/133) so we can't set SSL_VERIFY_NONE to the client. Anyway this is only useful for development purposes, we'll have to use our own CA for development for the moment.

As Mutual TLS Client:

Generate a key and a signed certificate for Apisonator:

key: openssl genrsa -out apisonator.key 4096
cert request: openssl req -new -key apisonator.key -out apisonator.csr
signed cert: openssl x509 -req -days 365 -in apisonator.csr -CA ca-root-cert.pem -CAkey ca-root-key.pem -CAcreateserial -out apisonator.crt

Then update the config file to set the next properties:

config.queues.master_name = "rediss://localhost:26379/5"
config.queues.username = "apisonator"
config.queues.password = "secret#Passw0rd"
config.queues.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem",
  cert: "/home/jlledom/apisonator.crt",
  key: "/home/jlledom/apisonator.key"
}
config.redis.proxy = "rediss://localhost:26379/6"
config.redis.username = "apisonator"
config.redis.password = "secret#Passw0rd"
config.redis.ssl_params = {
  ca_file: "/home/jlledom/ca-root-cert.pem",
  cert: "/home/jlledom/apisonator.crt",
  key: "/home/jlledom/apisonator.key"
}

As mentioned above, the ca_file fields are only required when the server isn't using a certificate signed by one of the well-known CAs.

It could be also configured using the new ENV variables (see docs/configuration.md)

eguzki commented 7 months ago

For some reason the call to Storage.instance.get(stats_key) in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.

It needs to be fixed

jlledom commented 7 months ago

For some reason the call to Storage.instance.get(stats_key) in https://github.com/jlledom/apisonator/blob/THREESCALE-8404-redis-acl-tls/spec/integration/worker_async_spec.rb#L63 never returns. It happens after a few calls to it. Usually after second or third call.

It needs to be fixed

@eguzki This fixed it: https://github.com/3scale/apisonator/pull/350/commits/c94bd275d67261aff1cd27ed42bf12e3a3bff7ef

What do you think?

eguzki commented 7 months ago

You are saying that multiple threads (were) are using the same async redis client. Which threads are those?

jlledom commented 7 months ago

@eguzki

You are saying that multiple threads (were) are using the same async redis client. Which threads are those?

For the failing test, there are three threads accessing the storage in parallel:

The test itself: https://github.com/3scale/apisonator/blob/master/spec/integration/worker_async_spec.rb#L63
The Thread that runs the worker: https://github.com/3scale/apisonator/blob/master/spec/integration/worker_async_spec.rb#L52
The thread consuming Resque jobs: https://github.com/3scale/apisonator/blob/master/lib/3scale/backend/worker_async.rb#L90

eguzki commented 7 months ago

here are three threads accessing the storage in par

What changed in your PR to be broken? It makes all the sense that an async client is not shared between threads. Is it shared in (current) master branch? I understand that in the worker, one thread is reading the queue and another one is processing the tasks. It works with one async client shared among threads. If it is not supported, how is it working?

If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.

jlledom commented 7 months ago

What changed in your PR to be broken?

I don't know.

Is it shared in (current) master branch?

Yes

If it is not supported, how is it working?

I don't know, maybe I'm wrong. Ideas?

If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.

The tests are the same as before, if they passed before, they should pass now to ensure the behavior is the same.

jlledom commented 7 months ago

@akostadinov I created this script to prove async_redis doesn't allow sharing the connection between threads:

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem 'concurrent-ruby', '~> 1.1.6'
  gem 'hiredis', '~> 0.6.1'
  gem 'async-redis', '~> 0.7.0'
  gem 'redis', git: 'https://github.com/3scale/redis-rb', branch: 'apisonator'
end

require 'concurrent'
require 'async/io'
require 'async/redis/client'
require 'hiredis'
require 'redis'

def async_thread_isolate
  client = Concurrent::ThreadLocalVar.new do
    endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
    Async::Redis::Client.new(endpoint, limit: 5)
  end

  Async { client.value.set("random_key", "a_key") }.wait

  result = []
  5.times.map { Thread.new {
    Async { result.push client.value.get("random_key") }
  }}.each(&:join)

  puts result
end

def async_thread_share
  endpoint = Async::IO::Endpoint.tcp('localhost', 22121)
  client = Async::Redis::Client.new(endpoint, limit: 5)

  Async { client.set("random_key", "a_key") }.wait

  result = []
  5.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.each(&:join)

  puts result
end

def sync_thread_share
  client = Redis.new({url: 'redis:localhost:22121'})

  client.set("random_key", "a_key")

  result = []
  5.times.map { Thread.new {
    result.push client.get("random_key")
  }}.each(&:join)

  puts result
end

method = ARGV.first || "async_thread_isolate"
send(method)

I run it from inside the apisonator container, like this (I saved the script as tmp/async_tests.rb under the apisonator source folder):

jlledom@fedora apisonator:master=]$ make dev

[ruby@apisonator-dev apisonator]$ bundle exec script/services start

ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb sync_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using bundler 2.3.5
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key

[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_isolate
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using fiber-annotation 0.2.0
Using fiber-local 1.0.0
Using timers 4.3.5
Using bundler 2.3.5
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using nio4r 2.6.1 (was 2.5.9)
Using protocol-redis 0.6.1
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using hiredis 0.6.3
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-io 1.38.0 (was 1.34.3)
Using async-pool 0.4.0 (was 0.3.12)
Using async-redis 0.7.0
a_key
a_key
a_key
a_key
a_key

[ruby@apisonator-dev apisonator]$ ruby tmp/async_tests.rb async_thread_share
Fetching https://github.com/3scale/redis-rb
Fetching gem metadata from https://rubygems.org/.....
Resolving dependencies...
Using bundler 2.3.5
Using fiber-local 1.0.0
Using nio4r 2.6.1 (was 2.5.9)
Using timers 4.3.5
Using protocol-redis 0.6.1
Using hiredis 0.6.3
Using redis 4.1.3 from https://github.com/3scale/redis-rb (at apisonator@7210a9d)
Using concurrent-ruby 1.1.10 (was 1.1.6)
Using fiber-annotation 0.2.0
Using console 1.23.2 (was 1.16.2)
Using async 1.31.0
Using async-pool 0.4.0 (was 0.3.12)
Using async-io 1.38.0 (was 1.34.3)
Using async-redis 0.7.0
  0.7s     warn: Async::Pool::Controller [oid=0xab4] [ec=0xac8] [pid=263] [2023-11-28 16:19:03 +0000]
               | Closing resource while still in use!
               | {"resource":"#<Async::Redis::Protocol::RESP2::Connection:0x00007fd8d0072ac8>","usage":1}
  0.7s     warn: Async::Task [oid=0xadc] [ec=0xaf0] [pid=263] [2023-11-28 16:19:03 +0000]
               | Task may have ended with unhandled exception.
               |   NoMethodError: undefined method `zero?' for nil:NilClass
               |   → /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:236 in `reuse'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:112 in `release'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `ensure in acquire'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-pool-0.4.0/lib/async/pool/controller.rb:102 in `acquire'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-redis-0.7.0/lib/async/redis/client.rb:119 in `call'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/protocol-redis-0.6.1/lib/protocol/redis/methods/strings.rb:62 in `get'
               |     tmp/async_tests.rb:46 in `block (3 levels) in async_thread_share'
               |     /home/ruby/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/async-1.31.0/lib/async/task.rb:261 in `block in make_fiber'
a_key
a_key
a_key
a_key

$ruby tmp/async_tests.rb async_thread_share returns a different problem each time you run it.

I haven't found any mentions to thread-safety in the async-redis repo, but async-redis is built on top of async, which according to this issue doesn't allow sharing resources between threads.

It seems logical to me: an Async::Redis::Client instance will have it's own @pool (https://github.com/socketry/async-redis/blob/v0.8.0/lib/async/redis/client.rb#L38), which we are sharing between threads like in the issue I mentioned above.

I still don't have an answer to why does it work in production, where Async::Redis::Client is also shared among threads.

akostadinov commented 7 months ago

I'm inclined to think that the reason not to fail in production is either that we don't really use threaded workers there, or load is not high enough or pure luck. I agree that async framework is not thread safe but with some changes to your test bed. Please bear with my long post.

At this part there is a small issue with the test bed.

  5.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.each(&:join)

  puts result

We wait for each thread but don't wait for each Async task. So Threads can die and the program may exit before all async tasks have completed. And we may (or may not) see the error message you posted above.

If we add a #wait call to the async task, then things always work. I created a gist so I'm able to link diffs and code lines. Here's the modification I'm talking about: https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions?diff=unified&w=#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544fR48

But a concurrency of 5 is not very stressing. So I changed it a little to create 55 async tasks in 55 threads, then wait for all threads and then for all async tasks before printing the results. Like this:

  55.times.map { Thread.new {
    Async { result.push client.get("random_key") }
  }}.map(&:value).map(&:wait)

  puts result

See https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544f

And the result was quiet telling. First of all I saw in console several messages:

               | Task may have ended with unhandled exception.
               |   FiberError: fiber called across threads

And additionally the process has hanged indefinitely until I interrupted it with Ctrl+C. And btw if I run the same code with 5 threads/tasks, then it completes alright.

Needless to say, the isolated client per thread method worked without issue, see https://gist.github.com/akostadinov/d7a0b00fabb48ca74f08fb54090993e7/revisions#diff-240e2d0b41b99e3ebf81ae17215a432402bfe29406f7e71df271abee4b1b544f

Unfortunately I'm looking at the async framework for the first time but it appears to me that it is a known fact that these async tasks are not thread safe. And tasks and pools shouldn't be shared across threads. While looking at your PR I spotted an async project issue where the resolution was:

the pool created in one thread may have been memoized and used from another thread. I removed the memoization and it fixed the issue

This reads to me as this is a known design limitation. So overall the taken approach by you - create an async client per thread and avoid sneaky modifications of instance variables, are the right (if not the only) way to go forward. Thank you for doing a great detective job and for the nice test bed for thread safety.

An interesting question still is, why test in the master branch didn't fail. So I think it is worth spending some more time checking whether the updated code is fully sound. Or is it just a difference of the generated instructions? We can try increasing the concurrency of the original test in master and see if it's gonna fail at some point.

P.S. I'm running the script locally, just using Ruby 3.0.4, it is reproducible like that just fine at least for me.

jlledom commented 7 months ago

@akostadinov Thanks for your research. In fact, calling .wait for tasks makes all tests always pass with 5 threads, but only the async_thread_shared test fail with 55 threads.

About why it doesn't fail in master, I have a couple of theories.

In worker_async_spec.rb (the test failing) we have the worker running in a separated thread, but that doesn't represent production. In production the worker runs in its own process. That process has two threads, one to read from the queue (code), another one to process the jobs (code). The thread processing the jobs connects to redis through the shared instance (code). But the thread reading from the queue access redis using the JobFetcher redis instance (code) which is initialized once during the startup (code), and it creates a new instance (code) instead of taking the shared one. Observe the comment over redis_client (code). So basically there are not shared client among threads in the worker, the fetcher has it's own instance and the processor it's taking the shared one and not sharing it with anybody else because there are no more threads.

About the listener process, I'm not sure how multithreaded is it really, I guess Puma creates a thread for every request, but as proven in the test script we wrote, it can cope fine with a certain amount of parallelism without failing, maybe we never reach that workload in production. On the other hand, there's that problem about all threads sending their commands to the pipeline whenever any thread opens any pipeline anywhere, and those commands are eventually sent to redis one after another from a single thread when the pipeline closes. That would contribute to reduce the number of situations where there are multiple threads actually accessing async_redis at the same time.

akostadinov commented 7 months ago

Additionally running worker_async_spec.rb with n_reports = 150 instead of the original 10 show the following error in console as well deadlocks the whole process

10.14s     warn: Async::Pool::Controller [oid=0x7cec] [ec=0x2184] [pid=1623621] [2023-11-30 00:47:58 +0200]
               | {"resource":"#<Async::Redis::Protocol::RESP2::Connection:0x00000000035c2fd0>","usage":1}
               | Closing resource while still in use!

To me this shows that the original version also exhibited thread unsafety (as expected). For whatever reason it manifested somehow not as prominent. But still unacceptable.

I'd like to explore 2 areas of further development here. Validate better the new async implementation. Or possibly remove async altogether if we can observe that the sync implementation has similar performance to async.

akostadinov commented 7 months ago

If the multithreading is added in the tests, then we should change the tests, not the base code of apisonator.

Probably at runtime we actually avoid sharing the client between threads [1]. So the test seems inconsistent with runtime environment in that regard. I'm concerned though whether a client may leak between threads in future development or runtime configuration changes and I'm inclined to err on the side of safety. Especially while this project is new to us.

@eguzki , do you have confidence that we never access same storage client from multiple threads?

[1] https://github.com/3scale/apisonator/blob/56ad75236dba3dbfd1d30f5c48716048e2176cab/lib/3scale/backend/worker_async.rb#L95-L100

jlledom commented 7 months ago

@akostadinov

Additionally running worker_async_spec.rb with n_reports = 150 instead of the original 10 show the following error in console as well deadlocks the whole process

True. Running worker_async_spec.rb on master with CONFIG_REDIS_ASYNC=true raises the Fiber called across threads error among others, also inside the make dev container That's the confirmation that master is broken also.

Probably at runtime we actually avoid sharing the client between threads [1]

Apisonator runs in two separated processes (even different pods in openshift): the worker and the listener. That comment is only about the worker. It doesn't affect the listener, where Puma usually opens a new thread for each request, and theoretically any thread could access the async_redis client.

However, it seems we explicitly configure Puma to run in single-threaded mode. If I'm not wrong, the number of workers and threads in puma is configured here:

https://github.com/3scale/apisonator/blob/c697f4f6cf8cf2dac13295b525a630f37b216661/lib/3scale/backend/server/puma.rb#L39-L40

The value for server_model is set in the manifest file, here:

https://github.com/3scale/apisonator/blob/1dd6d143a3d7922004ec34e8a7410a30000d4aa4/lib/3scale/backend/manifest.rb#L40-L47

The number of workers is taken from the LISTENER_WORKERS env variable, which is set to 1 in production (you can verify this in our Openshift). The number of threads per worker is set to [1,1] because thread_safe? is set to false:

https://github.com/3scale/apisonator/blob/1dd6d143a3d7922004ec34e8a7410a30000d4aa4/lib/3scale/backend/manifest.rb#L16-L19

The comment above points out how they knew the entire backend is not thread safe. So everything is clear now:

It works in production because production is explicitly configured to be single-threaded
All concurrency is managed exclusively by async with fibers within a single thread
worker_async_spec.rb is wrong because it should avoid multthreading as in production
Maybe we don't need that Concurrent::ThreadLocalVar hack because it probably won't fail in production

jlledom commented 7 months ago

Today I've been investigating something I don't understand. When running apisonator locally with CONFIG_REDIS_ASYNC=true, I always get a No async task available! error when receiving the first request from porta.

According to some issues I found, I assume all calls to @async_redis must be done under and Async { ... } block. Async-redis docs seem to confirm that, as all examples are wrapping client calls under an Async block.

That's probably the reason to this block that wraps every test:

https://github.com/3scale/apisonator/blob/72977e834c311eb592d478aa74d2f44428724a30/test/test_helper.rb#L41-L43

Or this one for specs (nice comment, btw):

https://github.com/3scale/apisonator/blob/72977e834c311eb592d478aa74d2f44428724a30/spec/spec_helper.rb#L62-L69

It all makes sense, but now I'd like to now why we are not seeing this No async task available! error on production...

¯\(ツ)/¯

I'll let you have fun with this @akostadinov

slopezz commented 7 months ago

Hi @jlledom @akostadinov I was just following your thread.

I'm not a ruby expert at all, but I would like to share I few things I know from production that might not be 100% accurate but I think is worth commenting while you work on this issue:

AFAIK async mode has some limitations like only permitting 1 single worker https://github.com/3scale/apisonator/blob/master/docs/async.md#limitations
However there are so many advantages having just 1 worker that makes having a single worker worth it:
- Performance is maximized, requests handled by a single nackend-listener container while offering really good latencies.
  - apisonator performance is crucial for 3scale SaaS (the only SLA in SaaS is at backend-listener, service availability with good latency)
  - we handle up to 10,000 requests/second with 95th percentile latency under 80ms
  - A year ago a library update of async-pool introduced a mem leak, and also made performance went down like 40%, and the only reason was a change on redis connection pool management on the library (that apparently was an innocent change), fixed I think at https://github.com/3scale/apisonator/pull/324
- It can be used HPA (HoritzontalPodAutoscaler), because we have the specific cpu usage from a single worker within the container. Having many workers within the same container with different usage each container does not help to be accurate for scaling by CPU with HPA, some workers might be really stressed at cpu level while others might be more idle, and on high performance scenario averages are not good
- Prometheus metrics are gathered from that single worker. This is a current issue from porta, where prometheus metrics are taken for every container, but metrics shown are from different workers, making prometheus metrics totally unreal and useless (cc @mayorova )

jlledom commented 7 months ago

@slopezz thanks for your comments.

* AFAIK async mode has some limitations like only permitting 1 single worker  https://github.com/3scale/apisonator/blob/master/docs/async.md#limitations

Shit, I would have saved some time if I have known about this doc before 😔.

* However there are so many advantages having just 1 worker that makes having a single worker worth it:

I assume by "worker" you mean one process. A process can have many threads in parallel and each thread can have many concurrent fibers. Right now we stick to 1 process -> 1 Thread -> Many fibers. So I think there's room for improvement on performance, and still keeping the single process so all your points below still meet.

Anyway this is all just ideas, in this PR I'll do the less changes as possible to implement TSL.

    * we handle up to 10,000 requests/second with 95th percentile latency under 80ms

Did you use the sync mode before? did you notice a performance boost in migrating from sync to async mode?

    * A year ago a library update of  `async-pool` introduced a mem leak, and also made performance went down like 40%, and the only reason was a change on redis connection pool management on the library (that apparently was an innocent change), fixed I think at  [THREESCALE-7864 async-pool v0.3.12 #324](https://github.com/3scale/apisonator/pull/324)

I upgraded that gem in that PR because "Why not?", but I'm removing the upgrade, it's not needed for this PR.

slopezz commented 7 months ago

I assume by "worker" you mean one process. A process can have many threads in parallel and each thread can have many concurrent fibers. Right now we stick to 1 process -> 1 Thread -> Many fibers. So I think there's room for improvement on performance, and still keeping the single process so all your points below still meet.

I don't know the internals, what I know is that we change a few things of the containers compared to default sync mode:

LISTENER_WORKERS = 1
CONFIG_REDIS_ASYNC = true
Containers args (falcon) https://github.com/3scale-ops/saas-operator/blob/08b0d8331def033accce0c85e7180ae2d69b8c3e/pkg/generators/backend/listener_deployment.go#L40-L46

Did you use the sync mode before? did you notice a performance boost in migrating from sync to async mode?

We switched from legacy virtual machines to OCP-async directly. However a few performance tests (handle rps) were done by me and David (ex apisonator maintainer) on the early stages of OCP testing, I have just found a couple of examples :

https://github.com/3scale/infrastructure/issues/698#issuecomment-517296719

https://github.com/3scale/infrastructure/issues/698#issuecomment-542228578

I upgraded that gem in that PR because "Why not?", but I'm removing the upgrade, it's not needed for this PR.

I think it is OK updating libs in general to keep you code up-to-date, however for the case of apisonator where high performance is expected, in the past we have seen that innocent updates can break things, so it is important to do good testing on staging environment before promoting an image to production, like done by Eguz on https://github.com/3scale/apisonator/pull/324 or even use tools like hey to do some simple load testing in staging before and after updating the image, to check if the handle traffic is similar on the 2 images.

Actually, with our saas-operator we can easily do canary testing on staing or production environments, meaning that we can test 2 different images with live traffic deciding how many pods of each image run at the same type, and deciding if we want (or not) sent traffic to this canary deployments.

So when some changes are introduced in the code and we are not 100% sure they are safe because they cannot be properly tested in staging, we can use that saas-operator canary feature to introduce a limited number of pods with a newer image, and get stats from prometheus... to see if performance is acceptable or not.

jlledom commented 6 months ago

Today I've been investigating something I don't understand. When running apisonator locally with CONFIG_REDIS_ASYNC=true, I always get a No async task available! error when receiving the first request from porta.

I answer to myself here. In production, the async mode is launched with Falcon instead of Puma. Falcon is ready for use with the async libraries we use in the async mode, like async, async-io and async-redis.

jlledom commented 6 months ago

I added some tests for TLS and ACL connections. However they can't run inside the container because the image (quay.io/3scale/apisonator-ci:20230418-02) installs Redis 5, which doesn't support TLS not ACL.

I create an issue for that, as I'm not sure upgrading the CI image is under the scope of this PR.

akostadinov commented 3 months ago

IDK how was this implemented but I see an upstream example protocol wrapper for using TLS and logical databases that easily worked for me. Not sure how easy same can apply to sentinels though

see my lazy PoC https://github.com/3scale/apisonator/issues/135#issuecomment-2020485196

jlledom commented 1 month ago

Closing this PR. I created https://github.com/3scale/apisonator/pull/391 to make this easier to review:

3scale / apisonator