Unexpected behavior when "ARR affinity" cookies is disabled

MikeAlhayek commented 2 years ago

Describe the bug

I am running OC CMS by using multiple Azure services. For this app, I am using the following services

Storage Account to host "dataprotection", "shell" and media files.
SQL database to host the data
Redis cache to distribute my caching
Azure Web Service for Containers
Other services that may not be relative to this problem like Azure Registry...

My app is running using 3 instances of the web services. so when a user makes a request to the website there are 3 different instances that could potentially handle the request. I am testing out the app scalability and would like to disable the ARR Affinity cookie. When the ARR cookie is disabled, the app seems to not work as expected. It's like the app can't handle the "form post request which is a second request a user would make to submit a web form".

To Reproduce

Steps to reproduce the behavior:

Add the following services to the host project

services.AddOrchardCms()
    .AddAzureShellsConfiguration()
    .AddGlobalFeatures("OrchardCore.DataProtection.Azure");
    .AddSetupFeatures("OrchardCore.Redis.Lock")
    .AddGlobalFeatures("OrchardCore.Redis.Cache", "OrchardCore.Redis.Lock");

Add the following settings in the configuration section in the Azure Web Service a.OrchardCore__ConnectionString=ConnectionStringToAzureDatabase b.OrchardCore__DatabaseProvider=SqlConnection c.OrchardCore__Default__State=Uninitialized d.OrchardCore__Default__TablePrefix=Default e.OrchardCore__OrchardCore_DataProtection_Azure__ConnectionString=ConnectionStringToStorageAccount f.OrchardCore__OrchardCore_DataProtection_Azure__ContainerName=dataprotection g.OrchardCore__OrchardCore_Media_Azure__BasePath=Media h.OrchardCore__OrchardCore_Media_Azure__ConnectionString=ConnectionStringToStorageAccount i.OrchardCore__OrchardCore_Media_Azure__ContainerName=media-{{ ShellSettings.Name }} j.OrchardCore__OrchardCore_Media_Azure__CreateContainer=true k.OrchardCore__OrchardCore_Redis__Configuration=ConnectiontToTheRedisDatabase,allowAdmin=true l.OrchardCore__OrchardCore_Shells_Azure__ConnectionString=ConnectionStringToStorageAccount m.OrchardCore__OrchardCore_Shells_Azure__ContainerName=hostcontainer
Make sure you scale out the app service to at least 3 instances
Publish the app to azure web service using docker containers.
Setup your default site using SaaS recipe. Ensure that the app is working with no know error as it should.
Let's attempt to reproduce the issue now by disabling the ARR Affinity cookie as explained here
With the ARR cookies disabled, go to the "Tenants" section of your dashboard and add 2+ new tenant. The results will be that one or more of the tenants won't be added or show up.
If you now enable the ARR Affinity cookie some of the tenants you added in the previous step my show up. Weird?

Expected behavior

When the ARR Affinity cookie is disabled, everything should work the same since my app is using a single instance of the shell, data-protection and distributing my cache using Redis database.

jtkech commented 2 years ago

To keep in sync tenant states accross instances you have to enable the Distributed Tenants feature

MikeAlhayek commented 2 years ago

To keep in sync tenant states across instances you have to enable the Distributed Tenants feature

@jtkech thank you! I enabled the "Distributed Tenants" on the default tenant where the SaaS recipe is enabled, but that still isn't working as expected.

I added 2 tenants in addition to the default one. But, sometime when I refresh the Tenants page I see only 2 and sometimes I see all 3 and other time I see only the default tenant. So something isn't right where each instance is reading from a different source. I could not even enabled the "Distributed Tenants" while the ARR cookie disabled. I had to turn it back on and then enable the "Distributed Tenants" feature. It's like each instance is caching it's own IDocumentManager<SiteSettings> instead of using a distributed cache.

Do I need to also enable other distributed cache features like OrchardCore.Redis.DataProtection and OrchardCore.Redis.Bus? Also, will the OrchardCore.Redis.Cache feature ensure everything cashed in memory is caches in Redis instead like the app settings?

jtkech commented 2 years ago

Yes, you need a stateless config as most as possible, also for tenant settings/configs.

I saw that you use AddAzureShellsConfiguration, I only tried AddDatabaseShellsConfiguration, but looks okay.

Do I need to also enable other distributed cache features like OrchardCore.Redis.DataProtection

Not required but good to use it.

and OrchardCore.Redis.Bus

No, we don't use it internally, but you could if you use it explicitly.

Also, will the OrchardCore.Redis.Cache feature ensure everything cashed in memory is caches in Redis instead like the app settings?

We use the distributed cache for different documents as site settings, templates, layers, and so on, all that uses the IDocumentManager, also data used by the DynamicCache e.g. for shape caching ... We don't cache shell settings/configs, they should be in shared stores as blob storage or the shared database, we only cache tenant states, so that a given instance is aware if a tenant was created or released so that it will be reloaded on a next demand, this by comparing states in local memory from states in the distributed cache, this is when a tenant is reloaded that settings/configs will be get again from the shared stores.

Hmm, after enabling the Distributed Tenants feature, you need to stop all your instances and then restart.

MikeAlhayek commented 2 years ago

@jtkech That seems to have worked. Here is what I did

Added OrchardCore.Redis.DataProtection feature globally like this .AddGlobalFeatures("OrchardCore.Redis.Cache", "OrchardCore.Redis.Lock","OrchardCore.Redis.DataProtection")
Scaled down/in my app to 1 instance.
Enabled "Distributed Tenants" on the default instance
Disabled the ARR Affinity cookie
Scaled out my app back to 3 instances

jtkech commented 2 years ago

@CrestApps Good to know, thanks!

Yes for sure, forgot to say that to make a given Redis feature working, it should be enable on each tenant, can be done by using setup recipes including it, or as you did from the app, can be also done from configuration.

MikeAlhayek commented 2 years ago

@jtkech Thank you!

I am using .AddGlobalFeatures("OrchardCore.Redis.Cache", "OrchardCore.Redis.Lock","OrchardCore.Redis.DataProtection") to enable these features for every clients. Is there something else I need to do in the recipe?

I am actually experiencing an issue when setting up a new tenant. When I click on "Setup" on a new tenant, the request process for few seconds and then it shows blank while screen and the site never gets setup. Unfortunately, I can seem to be able to view the log stream in Azure Web Service to see what could be erroring. When I scale down to a single instance, the setup works. Not sure how enable logging so logs from all instances is written to the log stream. Also, not sure what else do I need to do to setup new tenant.

The post request on setting up the new client return HTTP Code 400. This only happens when I am using multiple nodes not when I am using a single instance.

jtkech commented 2 years ago

For the logs would need a shared App_Data folder under which you could see them, or the ability to connect to a given VM to see their local files.

You could try to include the related service in all recipes that you are using to setup tenants. But normally for the tenant states themselves you only need to have Redis Cache and Lock on the Default tenant of each instance, having the Cache on other tenants is still important to keep in sync other document data accross instances.

Hmm, is it working if you re-enable ARR’s Instance Affinity, one problem I see is that when a tenant was created we show a link that contains a secret token in the query string, based on the secret of the tenant specific settings

{
  "DatabaseProvider": "Sqlite",
  "RecipeName": "Blog",
  "Secret": "18ef57f8-6875-4f4c-8102-fe2f5e8e426b"
}

Maybe we have to wait a little so that other instances are in sync with these new tenant specific settings, or maybe the generated token is only known by the instance that rendered the Tenant Index page, or not yet shared through the Redis DataProtection

entry.Token = dataProtector.Protect(x["Secret"], _clock.UtcNow.Add(new TimeSpan(24, 0, 0)));

MikeAlhayek commented 2 years ago

@jtkech It seems to be working fine with the ARR Affinity cookie is enabled. When we generate the token in the query, that token should be available to all instances just like the aniforgery token or the session manager. I think everything should be read from a shared resources database, file, memory... but all instances would need to always read from the same source without having to worry about instances to sync.

jtkech commented 2 years ago

that token should be available to all instances just like the aniforgery token

Yes I agree, normally that's the goal of the Redis DataProtection

but all instances would need to always read from the same source without having to worry about instances to sync.

I don't think so, currently they reload data when a tenant state was changed. Here the code where we check from the distributed cache if tenants were created, in which case we load data from shared sources.

https://github.com/OrchardCMS/OrchardCore/blob/d187e857038db766d7fe7fe832f638e6a0bbe426/src/OrchardCore/OrchardCore/Shell/Distributed/DistributedShellHostedService.cs#L139-L152

Hmm, seems that here the only problem is around the DataProtection of the Default tenant on each instance, we use the oob aspnetcore Redis DataProtection.

Can you retry it and wait a little after having created a new tenant, also try to clear all your browser cookies.

MikeAlhayek commented 2 years ago

@jtkech Thank you for the feedback. So here is what I have done

Disabled the ARR cookie
Scaled out to 2 instances
Started a new private browsing session "no cookies"
Added Demo4 tenant. It showed up on the tenants screen right away
Added Demo5 tenant. It showed up on the tenants screen after couple of refreshes until the load balancer placed my request on the server with the most recent shell-info.
Waiting about 5 mins and tried to setup Demo4 tenant. This did not work. I got HTTP Code 400 in response to the post request.
Waiting little over an hour and tried to setup Demo5. Unfortunately, I am still getting HTTP 400.

jtkech commented 2 years ago

@CrestApps

Sorry, would need to try it but quite busy, hmm and with a paid plan to scale out to multiples instances ;) Or still use your plan for testing and if you give me some accesses I could try to debug it when I will have time, as I remember there are Kudu tools but that I didn't use for a while.

Anyway I'm interested to understand what happens, are you using OrchardCore as packages? Would be good to use the source code for testing, e.g. in the setup controller we could comment out where it checks the secret token and see what happens, and so on step by step.

Can you already check what is stored in your shared containers for shell settings/configs, can you find some data protection keys in your Redis instance. When you are logged in to go to the Admin, do you need to login again e.g. when the request is sent to another instance.

In the meantime for info, looking at the code I saw that here we always use the data protection provider of the Default tenant, to generate the token on a given instance, and to check it on a given instance when setting up any tenant. The secret token is valid 24 hours, if it doesn't match we return a 404, if not authorized we return 403.

jtkech commented 2 years ago

In #11526 we saw that the antiforgery cookie name depends on the ContentRootPath that may be different for the same app but accross different instances.

jtkech commented 2 years ago

@CrestApps

Because of the following I'm re-opening this issue.

Okay, with a local single K8s cluster node I installed locally the same kind of configuration, one redis pod, one mssql pod and for now 2 OC pods, and I used AddDatabaseShellsConfiguration() in place of AddAzureShellsConfiguration().

The "good new" is that I can repro locally the exact same behavior, so I will be able to investigate without having to subscribe for a paid plan on Azure during all the tests ;)

The first test I did is to decorate the SetupController with [IgnoreAntiforgeryToken], then all was working fine, no problem with the tenant secrets, tenant identifiers, authentications and so on. This allows us to better localize the source of the problem, we now can focus on Antiforgery.

Yes, maybe because when a tenant is not yet initialized it still uses at the very beginning some host level registrations, I will look at it more in details when I will have time, but for now not so easy to debug in pod containers with tools that I haven't used for a year ;)

https://github.com/OrchardCMS/OrchardCore/blob/4ff4e4b95804e6064daace3ed3e5f7bbce0cb6c0/src/OrchardCore/OrchardCore/Modules/Extensions/ServiceCollectionExtensions.cs#L306-L319

MikeAlhayek commented 2 years ago

I am glad you are able to re-produce this issue locally! This is very good first step! Thanks for recalling this issue

jtkech commented 2 years ago

@CrestApps

So, we need a distributed DP while the tenant is in a setup context. One solution would be to inherit of some host level registrations, I made it working but by also preventing some tenant level registrations while the tenant is not initialized (as we currently do for AFT) and by adding e.g in the app startup.

var redis = ConnectionMultiplexer.Connect("connectionString"); services.AddDataProtection() .PersistKeysToStackExchangeRedis(redis, "DataProtection-Keys");

But this is not the right solution, to enable features while a given tenant is in a setup context, the right solution is to just use AddSetupFeatures() (see below), so that all still happens at the tenant level.

First it was not working because currently we don't add tenant level AFT if the tenant is not initialized, this is related to some decryption errors we had when a cookie was already generated by another instance but with different DP keys. But we now use unique cookie names based on the settings VersionId, so I re-added the tenant level AFT even if in a setup context, and all was working.
So, the solution is to add the needed stateless features through recipe(s) or AddGlobalFeatures() from the app as you did. Then, as we need a distributed DP while a given tenant is in a setup context, we just need to add e.g. the redis DP to the setup features (even if already added to the global features).

services.AddOrchardCms() .AddDatabaseShellsConfiguration() .AddSetupFeatures( "OrchardCore.AutoSetup", "OrchardCore.Redis.DataProtection");

But as said, to make it working we need to re-introduce the tenant level AFT registrations even if the tenant in a setup context is in the Uninitialized state, I will do a PR for this.

MikeAlhayek commented 2 years ago

@jtkech I think the problem goes beyond the setup page. The issue happens with any form submitting. Did you try to perform other actions like creating/editing content?

jtkech commented 2 years ago

@CrestApps

But now we don't have the same config, for the setup concern I added the redis DP (good to try if you were using azure DP) to the AddSetupFeatures(), I updated the code to still register tenant level ATF even if the tenant is uninitialized, so that we don't mix different DPs, e.g. the default one from the host.

We now use unique cookie names, maybe not yet the case for you, also I needed another update, at some point the settings VersionId was null, leading up to an __orchantiforgery_ cookie name which was no more unique, so I updated it and did the tests again after having cleared previous cookies.

Just setup 10 tenants (from different opened browsers) without any issues, and then edited / published content items in different tenants, all is working fine on my side. To try it on your side you will have to wait for the tenant level ATF updates that I mentioned, I will do a PR soon.

MikeAlhayek commented 1 year ago

@jtkech, @ns8482e seems to be experiencing similar issue to this one but he is running on Kubernetes cluster not on AppService.

An example of the issue is when he enables a feature an alert stating that the feature was enabled will be displayed but the status of the feature will show disabled when the page is loaded "post request". After few refreshes he gets the correct state of the feature. It's like the user during the post request is handled by a different node on the cluster that isn't aware of the shell reload.

Do we store the IoC container in a distributed cache? Meaning, if the shell is released do we build one container for all of the nodes? I am guessing not. If not, when we release shell, do we have a way ensure that all the nodes in a cluster are released? I know you worked on a PR that had to do with clustering OC internally. Not sure what is the status in that PR.

sebastienros commented 1 year ago

@MikeAlhayek did you enable any cache/reload synchronization feature like Redis ones? Have you setup a way to share data protection keys too?

MikeAlhayek commented 1 year ago

@sebastienros yes. He is using a shared directory for protection keys, media and shell host. He is using Redis for distributed caching and using the following features

"Redis", "Redis Bus", "Redis Cache", "Redis DataProtection", "Redis Lock"

I think when a shell-release happens, nodes on the same cluster are not aware of this important event which could be causing the disconnect.

jtkech commented 1 year ago

To keep in sync tenants states we have the OC.Tenants.Distributed feature

jtkech commented 1 year ago

Description = "Keeps in sync tenants states, needs a distributed cache e.g. 'Redis Cache' and a stateless configuration, see: https://docs.orchardcore.net/en/latest/docs/reference/core/Shells/index.html",

ns8482e commented 1 year ago

@sebastienros , @MikeAlhayek @jtkech

I have a plain OC 1.7 Project with following program.cs and docker image created that deployed using kubernetes. with connection to redis

using OrchardCore.Logging;
using Surevelox.OrchardCore.K8S.Host;

var builder = WebApplication.CreateBuilder(args);
builder.Host.UseNLogHost();

builder.Services.AddOrchardCms()

    .AddGlobalFeatures("OrchardCore.Redis.Cache")
    .AddGlobalFeatures("OrchardCore.Redis.Lock")
    .AddGlobalFeatures("OrchardCore.Redis.Bus")
    .AddGlobalFeatures("OrchardCore.Tenants.Distributed")
    .AddSetupFeatures("OrchardCore.Redis.Cache")
    .AddSetupFeatures("OrchardCore.Redis.Lock")
    .AddSetupFeatures("OrchardCore.Redis.Bus")
    .AddSetupFeatures("OrchardCore.Tenants.Distributed")

    ;

var app = builder.Build();

app.UseOrchardCore();
app.Run();

jtkech commented 1 year ago

Looks like you are missing OC.Redis.DataProtection.

ns8482e commented 1 year ago

Do I need to have it enabled if that uses shared folder with persisted volume claim?

jtkech commented 1 year ago

Ah okay, yes normally it should work.

jtkech commented 1 year ago

Humm, yes Tenants.Distributed keeps in sync tenant states but the polling time period is 10 seconds.

MikeAlhayek commented 1 year ago

@jtkech Polling for 10 second could explain the odd behavior. Nodes need up to 10 seconds to sync. So the post request will likely to be handled by a tenant that isn't yet reloaded which will render wrong state. I think unless we enable ARR cookie, this issue will always be a problem. Instead of polling, can we create event/listener across nodes using shared queue of some sort?

jtkech commented 1 year ago

At some point we thought about using event bus (we have a RedisBus feature), currently we sync in an hosted service based on identifier states, maybe we could use both.

jtkech commented 1 year ago

Yes on Azure we have ARR affinity, on another server we may need the tenant Clusters I was working on.

MikeAlhayek commented 1 year ago

Yeah RedisBus may do the job. I bet @ns8482e would love to take the lead and implement it to solve this issue 🤪

We could also use Azure Service Bus or Azure Event Hubs for this. If I have time, I may try it out.

ns8482e commented 1 year ago

@jtkech @sebastienros @MikeAlhayek created a bug https://github.com/OrchardCMS/OrchardCore/issues/14373

ns8482e commented 1 year ago

Atleast we should check tenant state in request pipeline and not wait for 10 seconds for hosted service to update

jtkech commented 1 year ago

Okay

For info I was wrong, there is no pooling time, only an idle time and of 1 second.

Maybe worth to try it without idle time or a smaller one

OrchardCMS / OrchardCore