Azure / Azure-Functions

1.11k stars 197 forks source link

Host ID Collision Error #2049

Open mathewc opened 3 years ago

mathewc commented 3 years ago

Motivation

This change is meant to address longstanding issue https://github.com/Azure/azure-functions-host/issues/2015. The underlying issue is caused by the following things being true:

  1. The computed host id is long enough that it requires truncation (app name > 32 characters in length).
  2. The Function app is sharing a storage account with another Function app whose truncated host id is the same.
  3. The Function app is using one or more trigger types that store state under HostID in the storage account

If these are all true, then there is a collision and you can end up with unexpected outcomes as the two apps share state.

When this issue is detected for an app a log will be written. In v4 this will be an Error. In v3 this will be a Warning:

"A collision for Host ID '{hostid}' was detected in the configured storage account. For more information, see https://aka.ms/functions-hostid-collision."

Impact

At the point in time when this issue was create, about 28% of v3 apps have a name subject to truncation. Of those, only apps that are sharing storage accounts would be affected, which is likely a much lower percentage.

Compat-mode support

There will be an app setting FUNCTIONS_HOSTID_CHECK_LEVEL that supports values Warning/Error. In v3 the default will be Warning, and in v4 it will be Error. By default when this issue is detected in v4 it will be a hard failure. As an escape hatch this app setting can be used to change back to warning if needed.

Alternatives

Yes we discussed alternatives, but none of those were acceptable.

Detection

The v3 warning can be used to detect this.

Support

The error will have an aka.ms link with documentation on the issue and the recommended course of action for affected customers to take.

Documentation

Documentation impact

Components impacted

Functions Host

Performance

No.

ijabit commented 2 years ago

In the upgrade docs here it says you should "Use a separate storage account for each function app" and then links to this GitHub Issue. But in the docs here it says, "It's possible for multiple function apps to share the same storage account without any issues". It does go on to say that "... this approach isn't always a good idea in a production environment", but it doesn't say not to do it. Can this be clarified in the docs?

balag0 commented 2 years ago

Its okay from the point of functionality as function apps will not step over each other if multiple apps sharing the same storage account in general (except for the issue mentioned above). But it is not recommended due concerns around best practices like availability (single point of failure), security (better isolation/access controls when each app has its own secret)

Novak-Peter commented 2 years ago

We ran into problems with this, using Functions v4, dotnet-isolated; however, the only symptom was that the job host constantly restarted, there were no errors nor warnings logged anywhere (while the constant stop-start was logged properly) (this was true for multiple functions, separately). The root cause was indeed the too long name, the issue was actually caused by the function and its staging slot. Actually not both the functions were restarting, only one, sometimes it was the staging slot but sometimes the live one.

The workaround we used is a slot-level config override for AzureFunctionsWebHost__hostId.

Could you please check how is it possible that there were no errors logged at all?

That's all the logs we see:

2022-04-26T12:30:59.278 [Information] Host lock lease acquired by instance ID 'eb635ef47d9374f47ee1dde305a3c9fd'.
2022-04-26T12:31:21.059 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:java
2022-04-26T12:31:21.063 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:node
2022-04-26T12:31:21.063 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:powershell
2022-04-26T12:31:21.103 [Warning] The Dashboard setting is no longer supported. See https://aka.ms/functions-dashboard for details.
2022-04-26T12:31:21.190 [Information] Initializing Warmup Extension.
2022-04-26T12:31:21.236 [Information] Initializing Host. OperationId: '1c31c0b9-8022-4828-8d8a-404dea961f31'.
2022-04-26T12:31:21.240 [Information] Host initialization: ConsecutiveErrors=0, StartupCount=1, OperationId=1c31c0b9-8022-4828-8d8a-404dea961f31
2022-04-26T12:31:21.279 [Information] ApplicationInsightsLoggerOptions{"SamplingSettings": {"EvaluationInterval": "00:00:15","InitialSamplingPercentage": 100.0,"MaxSamplingPercentage": 100.0,"MaxTelemetryItemsPerSecond": 20.0,"MinSamplingPercentage": 0.1,"MovingAverageRatio": 0.25,"SamplingPercentageDecreaseTimeout": "00:02:00","SamplingPercentageIncreaseTimeout": "00:15:00"},"SamplingExcludedTypes": "Request","SamplingIncludedTypes": null,"SnapshotConfiguration": null,"EnablePerformanceCountersCollection": true,"HttpAutoCollectionOptions": {"EnableHttpTriggerExtendedInfoCollection": true,"EnableW3CDistributedTracing": true,"EnableResponseHeaderInjection": true},"LiveMetricsInitializationDelay": "00:00:15","EnableLiveMetrics": true,"EnableDependencyTracking": true,"DependencyTrackingOptions": null}
2022-04-26T12:31:21.280 [Information] LoggerFilterOptions{"MinLevel": "None","Rules": [{"ProviderName": null,"CategoryName": null,"LogLevel": null,"Filter": "<AddFilter>b__0"},{"ProviderName": "Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLoggerProvider","CategoryName": null,"LogLevel": "None","Filter": null},{"ProviderName": "Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLoggerProvider","CategoryName": null,"LogLevel": null,"Filter": "<AddFilter>b__0"},{"ProviderName": "Microsoft.Azure.WebJobs.Logging.ApplicationInsights.ApplicationInsightsLoggerProvider","CategoryName": null,"LogLevel": "Trace","Filter": null}]}
2022-04-26T12:31:21.280 [Information] LoggerFilterOptions{"MinLevel": "None","Rules": [{"ProviderName": null,"CategoryName": null,"LogLevel": null,"Filter": "<AddFilter>b__0"},{"ProviderName": "Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLoggerProvider","CategoryName": null,"LogLevel": "None","Filter": null},{"ProviderName": "Microsoft.Azure.WebJobs.Script.WebHost.Diagnostics.SystemLoggerProvider","CategoryName": null,"LogLevel": null,"Filter": "<AddFilter>b__0"},{"ProviderName": "Microsoft.Azure.WebJobs.Logging.ApplicationInsights.ApplicationInsightsLoggerProvider","CategoryName": null,"LogLevel": "Trace","Filter": null}]}
2022-04-26T12:31:21.280 [Information] ConcurrencyOptions{"DynamicConcurrencyEnabled": false,"MaximumFunctionConcurrency": 500,"CPUThreshold": 0.8,"SnapshotPersistenceEnabled": true}
2022-04-26T12:31:21.280 [Information] FunctionResultAggregatorOptions{"BatchSize": 1000,"FlushTimeout": "00:00:30","IsEnabled": true}
2022-04-26T12:31:21.280 [Information] SingletonOptions{"LockPeriod": "00:00:15","ListenerLockPeriod": "00:01:00","LockAcquisitionTimeout": "10675199.02:48:05.4775807","LockAcquisitionPollingInterval": "00:00:05","ListenerLockRecoveryPollingInterval": "00:01:00"}
2022-04-26T12:31:21.280 [Information] ServiceBusOptions{"PrefetchCount": 0,"MessageHandlerOptions": {"AutoComplete": true,"MaxAutoRenewDuration": "00:05:00","MaxConcurrentCalls": 32},"SessionHandlerOptions": {"AutoComplete": true,"MaxAutoRenewDuration": "00:05:00","MaxConcurrentSessions": 2000,"MessageWaitTimeout": "00:01:00"},"BatchOptions": {"MaxMessageCount": 1000,"OperationTimeout": "00:01:00","AutoComplete": true}}
2022-04-26T12:31:21.281 [Information] Starting JobHost
2022-04-26T12:31:21.282 [Information] Starting Host (HostId=[too-long-functionname-trimmed-], InstanceId=8dd736e0-24c3-44b8-958c-073ca23ee788, Version=4.3.0.17971, ProcessId=15204, AppDomainId=1, InDebugMode=True, InDiagnosticMode=False, FunctionsExtensionVersion=~4)
2022-04-26T12:31:21.290 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:java
2022-04-26T12:31:21.290 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:node
2022-04-26T12:31:21.291 [Information] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated. Skipping WorkerConfig for language:powershell
2022-04-26T12:31:21.295 [Information] Loading functions metadata
2022-04-26T12:31:21.333 [Information] Reading functions metadata
2022-04-26T12:31:21.384 [Information] 1 functions found
2022-04-26T12:31:21.394 [Information] 1 functions loaded
2022-04-26T12:31:21.408 [Information] Generating 1 job function(s)
2022-04-26T12:31:21.444 [Information] Found the following functions:Host.Functions.HubListenerFunction
2022-04-26T12:31:21.475 [Information] Initializing function HTTP routesNo HTTP routes mapped
2022-04-26T12:31:21.479 [Information] Host initialized (192ms)
2022-04-26T12:31:21.486 [Information] HttpOptions{"DynamicThrottlesEnabled": false,"EnableChunkedRequestBinding": false,"MaxConcurrentRequests": -1,"MaxOutstandingRequests": -1,"RoutePrefix": "api"}
2022-04-26T12:31:21.579 [Information] Host started (296ms)
2022-04-26T12:31:21.579 [Information] Job host started
2022-04-26T12:31:22.277 [Information] Host Status: {"id": "[too-long-functionname-trimmed-]","state": "Running","version": "4.3.0.17971","versionDetails": "4.3.0+64614b7392f36e3871862d22d21e72304df901e9","platformVersion": "97.0.10.675","instanceId": "eb635ef47d9374f47ee1dde305a3c9fdb204691970dda38684f67a86ac3e26fb","computerName": "wn0sdwk0007EC","processUptime": 17534,"functionAppContentEditingState": "Unknown"}
2022-04-26T12:31:22.280 [Information] Host Status: {"id": "[too-long-functionname-trimmed-]","state": "Running","version": "4.3.0.17971","versionDetails": "4.3.0+64614b7392f36e3871862d22d21e72304df901e9","platformVersion": "97.0.10.675","instanceId": "eb635ef47d9374f47ee1dde305a3c9fdb204691970dda38684f67a86ac3e26fb","computerName": "wn0sdwk0007EC","processUptime": 17545,"functionAppContentEditingState": "Unknown"}
2022-04-26T12:31:22.353 [Information] Stopping JobHost
2022-04-26T12:31:22.354 [Information] Stopping the listener 'Microsoft.Azure.WebJobs.ServiceBus.Listeners.ServiceBusListener' for function 'HubListenerFunction'
2022-04-26T12:31:22.438 [Information] Worker process started and initialized.
2022-04-26T12:31:26.317 [Information] Host lock lease acquired by instance ID 'eb635ef47d9374f47ee1dde305a3c9fd'.
kepikoi commented 2 years ago

Our v3 Functions instances are configured to use unique hubNames but share a single storageProvider.trackingStoreNamePrefix in a single storage account. This setup allows each Functions instance to fetch the orchestration statuses of all the other running instances from a single source of truth. Using this technique, it is possible to apply a side-by-side deployment scenario of our Durable Functions based asynchronous API wherein the superseded "legacy" function backends of our API gateway controlled service can complete their running orchestrations while the "current" functions backend version can execute own orchestrations while still being able to retrieve the processing status of the legacy backend.

I understand that this setup would be impossible when using separate storage accounts, correct?

kepikoi commented 1 year ago

We are still facing the same challenge... Any ideas regarding my context?

ulluoink commented 3 months ago

@mathewc

I do not fully understand why we are geting this error from all of a sudden.

we have several functionapps which share one storage account and some of the functionapp names are longer than 32 characters.

But it worked for several years.

Furthermore I do not understand where to find and check the host ids. IN the storage account under azure-webjobs-hosts/ids/usage

there are several entries but only one contains the name of one of our functions apps.

The others are like this:

functionsv3x64placeholdertemplat functionsv4x64dotnetisolated6pla functionsv4x64dotnetisolated7pla functionsv4x64dotnetisolated8pla functionsv4x64node14placeholdert

etc. etc.

So for me it looks like you do not use the functionapp name to generate a host id but some other strange and weird method which really could not work ?

AND why is this suddenly changed?