[BUG] Function Fails to Load with PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 and newrelic dependency

Azure / azure-functions-python-worker

Python worker for Azure Functions.

http://aka.ms/azurefunctions

MIT License

331 stars 100 forks source link

[BUG] Function Fails to Load with PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 and newrelic dependency #1339

Closed ajstewart closed 5 months ago

ajstewart commented 8 months ago

Investigative information

We have a couple of Azure Functions on the Python linux consumption plan that have been running well. On October 26th around midnight UTC it seems as if a new version of the runtime was released and this caused the functions to stop working completely. To the point where the functions were not being loaded - in one case giving a python exited with code 139 (0x8B) error.

We were also unable to deploy the function successfully using a GitHub Action workflow or through the CLI.

We were able to narrow it down to the import of the newrelic python module - if all code was commented out then the function was able to deploy and run.

We then realised that the setting of PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 was the culprit. If this was removed then the function could load with the newrelic module still in use.

Repro steps

Please see this repo for a minimal example: https://github.com/ajstewart/azure-function-python-isolate-workers-error/tree/main.

If the PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 is set then the function fails to deploy through the CLI:

Deployment successful. deployer = Push-Deployer deploymentPath = Functions App ZipDeploy. Extract zip. Remote build.
Remote build succeeded!
[2023-10-31T11:44:55.112Z] Syncing triggers...
[2023-10-31T11:45:05.241Z] Syncing triggers...
[2023-10-31T11:45:24.763Z] Syncing triggers...
[2023-10-31T11:45:31.699Z] Syncing triggers...
[2023-10-31T11:45:37.215Z] Syncing triggers...
[2023-10-31T11:45:41.595Z] Syncing triggers...
Error calling sync triggers (BadRequest). Request ID = '354f9cec-793e-4d74-beaa-070b3dd6e767'.

Workarounds:

Removing the PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 setting enables the function to deploy again.

Removing the import newrelic.agent import line (leaving the PYTHON_ISOLATE_WORKER_DEPENDENCIES=1 in place).

Of course this seems to be specific to the newrelic module for us but it was working up until last week, and caused a complete outage of our functions.

Extra

I was also able to replicate a failing function using the docker function image and an integration test. Though interestingly the function did load when running in the CLI with func start.

KeyvanArj commented 8 months ago

I had an issue like this, when the PYTHON_ISOLATE_WORKER_DEPENDENCIES = 1, the python function deployment was failed. So, I've decided to solve it in this way:

Before running the deployment command:

$ az webapp config appsettings delete --name MyWebApp --resource-group MyResourceGroup --setting-names PYTHON_ISOLATE_WORKER_DEPENDENCIES

After deployment is done:

az webapp config appsettings set -g MyResourceGroup -n MyUniqueApp --settings PYTHON_ISOLATE_WORKER_DEPENDENCIES=1

It works now in CI/CD.

ajstewart commented 8 months ago

Thanks @KeyvanArj. I tried this however as soon as the PYTHON_ISOLATE_WORKER_DEPENDENCIES is set the function fails to load again after the restart.

So with this particular problem the function fails to load at all with that setting in place. Previously it had been working fine for almost a year up to this point.

ajstewart commented 8 months ago

I can add that the log stream shows this when the setting is applied.

2023-11-02T09:26:06Z   [Verbose]   Received request to drain the host
2023-11-02T09:26:06Z   [Information]   DrainMode mode enabled
2023-11-02T09:26:06Z   [Information]   Calling StopAsync on the registered listeners
2023-11-02T09:26:06Z   [Information]   Call to StopAsync complete, registered listeners are now stopped
2023-11-02T09:26:22Z   [Information]   Host lock lease acquired by instance ID '0000000000000000000000007F4883CB'.
2023-11-02T09:26:36Z   [Verbose]   Received request to drain the host
2023-11-02T09:26:36Z   [Information]   DrainMode mode enabled
2023-11-02T09:26:36Z   [Information]   Calling StopAsync on the registered listeners
2023-11-02T09:26:36Z   [Information]   Call to StopAsync complete, registered listeners are now stopped
2023-11-02T09:26:36Z   [Verbose]   Received request to drain the host
2023-11-02T09:26:36Z   [Information]   DrainMode mode enabled
2023-11-02T09:26:36Z   [Information]   Calling StopAsync on the registered listeners
2023-11-02T09:26:36Z   [Information]   Call to StopAsync complete, registered listeners are now stopped
2023-11-02T09:26:54Z   [Information]   Host lock lease acquired by instance ID '000000000000000000000000F71781ED'.
2023-11-02T09:27:10Z   [Verbose]   Received request to drain the host
2023-11-02T09:27:10Z   [Information]   DrainMode mode enabled
2023-11-02T09:27:10Z   [Information]   Calling StopAsync on the registered listeners
2023-11-02T09:27:10Z   [Information]   Call to StopAsync complete, registered listeners are now stopped

vrdmr commented 8 months ago

@ajstewart - Could you please share the function app name using this mechanism - https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately

ajstewart commented 8 months ago

@ajstewart - Could you please share the function app name using this mechanism - https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately

Hi @vrdmr, sure. The demo one where I'm replicating the error is running at funcdatanrtest and it is in a working state at the moment.

We had two other functions go down because of this problem around midnight UTC on October 25th going into October 26th.

Function 1 the last successful invocation id before we had to act was 3cca06a7-22f9-4ccd-b2eb-9c13cd869965 (10:54 pm UTC 25/10)

Function 2 the last successful invocation id before we had to act was cfa79493-c79c-4a12-bec9-bd5d16d0ad92 (10:15 pm UTC 25/10)

Both these functions are in a running state again now after removing the app setting.

Is this enough information?

ajstewart commented 6 months ago

Hello, I wondered if there would be any further investigation into this? We're evaluating whether Functions can remain in critical workflows, but if the runtime can break like this through no changes of our own we are inclined to move away. Interested in what caused this issue to to help understand. Thanks.

hallvictoria commented 6 months ago

Hey @ajstewart , apologies for the delay. I'm currently looking into this issue right now. Here's what I can see so far:

There aren't any exceptions or errors being thrown. Through Docker, I only see an exited with code 139 message. I'm leaning towards a potential segfault being the cause, but not confirmed yet.
This seems specific to the newrelic module (or other packages that depend on grpc or protobuf), which is why changing PIWD causes the function to work / fail. Those are also packages that we depend on.
I can see from logs on our end that there was an update to the worker version for your functions, and they started failing after this update. I'm reviewing our changes to see what happened to make the dependencies incompatible.

Thanks for your patience with this issue. In the meantime, please continue to set PIWD=0, and I'll continue investigating why this update caused your functions to break.

ajstewart commented 6 months ago

Hi @hallvictoria, many thanks for your reply!

Yes the lack of exceptions very much looked like segfault to me as well.

Thanks for the extra information, I'll be interested to see what you find!

Is there a way to freeze the worker versions on the functions? It seems non-ideal to perhaps have the rug pulled out from under you like this, like what happened here.

hallvictoria commented 6 months ago

This is an interesting issue. I found that we updated the protobuf and grpcio version dependencies (from 3.19.3 to 4.22.0 and 1.43.0 to 1.54.2 respectively), and this update seems to be the cause of this issue. Reverting to the previously installed versions worked as expected.

From this, it looks like we're installing dependencies that the newrelic module also depends on, and these dependencies are incompatible. Somewhere newrelic is using these packages as well but expects an older (or at least different) version than what we're installing. The weird thing is that, since these packages aren't defined in the sample requirements.txt file, the same versions should be installed whether or not PIWD is true.

Just curious, is there a specific reason you're setting PIWD=1?

Freezing worker versions isn't possible on consumption plans, only on dedicated. It's generally not recommended to pin versions, as this can cause unexpected dependency issues as well. I agree this was a miss on our end. For future releases involving dependency updates, we do plan on giving more notice and thoroughly testing it to ensure issues like this don't occur again.

ajstewart commented 6 months ago

This is an interesting issue. I found that we updated the protobuf and grpcio version dependencies (from 3.19.3 to 4.22.0 and 1.43.0 to 1.54.2 respectively), and this update seems to be the cause of this issue. Reverting to the previously installed versions worked as expected.

From this, it looks like we're installing dependencies that the newrelic module also depends on, and these dependencies are incompatible. Somewhere newrelic is using these packages as well but expects an older (or at least different) version than what we're installing. The weird thing is that, since these packages aren't defined in the sample requirements.txt file, the same versions should be installed whether or not PIWD is true.

Right, thanks for the explanation!

Just curious, is there a specific reason you're setting PIWD=1?

I naively thought that this setting would avoid this exact situation, the note on the docs says:

If your package contains certain libraries that might collide with worker's dependencies (for example, protobuf, tensorflow, or grpcio), configure PYTHON_ISOLATE_WORKER_DEPENDENCIES to 1 in app settings to prevent your application from referring to worker's dependencies.

So am I misunderstanding something, as the above seems like it would be a good idea, and thought it would avoid conflicts.

I was also having trouble with this issue: https://github.com/Azure/azure-functions-python-worker/issues/1245 where I was getting thread locking with the same function that had this problem. Now and then I would be getting random function timeouts because the process just hung, but wasn't consistent. Once I realised how the function runs with the pool I figured something must be conflicting in that environment and sure enough changing the number of threads to 1 fixed the issue. So I had also tried the PIWD setting here to combat that and had ended up leaving it.

Freezing worker versions isn't possible on consumption plans, only on dedicated. It's generally not recommended to pin versions, as this can cause unexpected dependency issues as well. I agree this was a miss on our end. For future releases involving dependency updates, we do plan on giving more notice and thoroughly testing it to ensure issues like this don't occur again.

Ok I wondered if that would be the case with the consumption. Are there preview releases of the runtime to test? As if the runtime is updated alongside a release there might not be time to test a function. It would seem that the only way to get a definite environment that will remain working is to upgrade the plan (which I don't really need to at the moment).

hallvictoria commented 6 months ago

We recommend setting PIWD=1 mainly if your app uses any of these libraries: Python Worker Dependencies. These are our worker dependencies.

I agree with your reasoning, and I'll see if we can make our docs more specific. We don't set PIWD to true by default because of a case where a package (dotenv, for example) uses a main.py file -- PIWD=1 plus this file overrides our worker's main.py file, which caused issues in importing modules and starting the worker successfully.

We don't have any way to preview upcoming releases. The one thing I could recommend for this is to have a duplicate function running in one of our early release regions: Central US EUAP, East US 2 EUAP, France Central, West Central US, North Europe, or West US 2.

Again, I'm really sorry for this issue and I'll keep investigating. I've narrowed it down to our internal protos module. It seems like that the reference to this module is being changed, and when our code tries to access the module it segfaults.

ajstewart commented 6 months ago

Ok, though I'm still a bit confused - should setting PIWD=1 should be used here if I know my dependencies are using one of those, e.g. protobuf?

I guess I'm not clear on the installation and package management method used in the runtime. With PIWD=1 it is prioritising loading the app's requirements first - are these installed independently of the workers modules? As it says with PIWD=0 that some modules may be shadowed, what does this mean in practise? Just attempting to understand the inner workings a bit better.

Thanks for your updates and responses, it's very much appreciated!

hallvictoria commented 6 months ago

There are two main cases we recommend setting the dependency isolation flag: (using protobuf as an example here, but I'm referring to any of the python worker dependencies mentioned before)

If your app is using protobuf
If you know that one of the packages you depend on has a dependency on protobuf

These are the cases where your app depends on a specific version of protobuf that might be different from what we use in the worker. If your app doesn't fit into those scenarios, enabling the flag isn't needed.

newrelic here is an odd case. newrelic on its own doesn't depend on protobuf or grpcio, but newrelic[infinite-tracing] does. seen here . I'd say in this case, just using newrelic, PIWD doesn't need to be set. If you were using newrelic[infinite-tracing], it would need to be set.

The main thing setting PIWD does is changing which dependencies should be prioritized. With PIWD=1, packages from the app will be prioritized. These packages are independent from the worker's -- the worker will run using its own dependencies. I'm not quite sure what you mean by shadowed modules?

For example, the worker depends on protobuf==4.22.0. If PIWD=0, 4.22.0 will be used by the worker and app (if it's needed in the app) If PIWD=1 and protobuf==3.19.3 is in requirements.txt, the app will run using 3.19.3 and the worker will use 4.22.0. If protobuf==3.19.3 isn't in requirements.txt, the app will also run using 4.22.0.

Does that help? Please let me know if you have any other questions! :)

ajstewart commented 6 months ago

Ah ok that's clearer - thanks!

On the shadowed comment I'm not sure, it's the wording from the docs page here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-app-settings#python_isolate_worker_dependencies I guess it just means you'll potentially be using an unexpected version when set PIWD=0 if it happens to be a worker module.

It seems like setting it to 1 would always be safer when you have pinned requirements (though I suppose the modules the worker uses is limited). But saying that I'm probably just about to add dotenv as a requirement to a function so from what you said earlier, it seems like I should avoid it where possible.

hallvictoria commented 6 months ago

Ah yes, I see, that's true.

There are a few packages in which setting PIWD=1 does cause unexpected issues, as in the dotenv case, so in general we recommend only setting it if the app uses protobuf or grpcio. That isn't an ideal scenario of course (PIWD=1 not working for certain packages), and it is something we're trying to improve upon.

I'm still investigating what exactly caused a segfault in this case, but I did find that adding protobuf and grpcio to the requirements.txt with PIWD=1 worked as expected.

hallvictoria commented 5 months ago

Hey, sorry for the delay. I figured out what happened here.

While newrelic doesn't install protobuf, protobuf is part of its infinite-tracing module and during setup newrelic checks if protobuf has been installed. Depending on the version it finds, it imports the appropriate pb2 file. If protobuf isn't found, newrelic assumes the version to be <4 and imports the v3 pb2 file.

TLDR; PIWD=0 means that newrelic's import also searches for modules from the worker dependencies. When trying to import protobuf, newrelic searched and found it from the worker dependencies and imported the appropriate pb2 file.

PIWD=1 means that newrelic searched for protobuf only from the app dependencies. Because the app dependencies didn't have protobuf (protobuf not in requirements.txt), newrelic assumed a specific protobuf version (<4). However, protobuf was brought through the worker, and when the version the worker brought didn't fit the assumption (the worker brought >= 4), newrelic's import failed and caused the worker to crash.

Before the segfault issue occurred: PIWD=1, protobuf not in req.txt, and the worker brings 3.19 The import statement checks user dependencies (per PIWD=1), doesn't find protobuf installed, and imports the v3 pb2 file (assumes protobuf version < 4). Protobuf is brought through the worker side, but newrelic is only checking for it from the user side. We then find protobuf v3.19 (brought as a worker dependency). This is compatible with the v3 pb2 file newrelic imported.

After the segfault: PIWD=1, protobuf not in req.txt, and the worker brings 4.22 The same thing happens on the newrelic side when trying to import protobuf -- it's not found as part of the user dependencies and the v3 pb2 file is imported. However, on the worker side, we updated to protobuf v4.22. Again, newrelic was checking if protobuf was installed only from the user dependencies and making an import based on that. So here newrelic was assuming protobuf v3 and importing a file based on that, but protobuf v4 was actually brought by the worker. Here is where the incompatibility happened, and the worker crashed.

Working case 1: PIWD=0 (worker dependencies are prioritized) Having PIWD=0 worked when the worker was bringing both protobuf v3.19 and v4.22. What happens here is that, since worker dependencies are prioritized, newrelic's import statement checks if protobuf has been installed and finds it from the worker (vs. PIWD=1 where it was only checking for it from the user dependencies). When the worker brought 3.19, newrelic imported the v3 pb2 file. When it brought 4.22, newrelic imported the v4 pb2 file. It didn't matter which version was brought, because PIWD=0 meant that newrelic's import found the protobuf module from the worker dependencies and makes a compatible import.

Working case 2: PIWD=1 and protobuf in req.txt The import statement checks user dependencies, finds protobuf, and imports the appropriate pb2 file. The worker and app are each bringing their own protobuf version, and the each side uses the version that they respectively bring. There's no conflict because even if the app brings protobuf v<4 and the worker uses v>=4, newrelic will use the version from the app side.

The main takeaway here imo is that PIWD changes where newrelic looks for protobuf in its import statement. I think newrelic's version assumption is an odd case -- protobuf isn't a direct dependency of newrelic and this scenario where protobuf wasn't found yet was still brought is probably rare. Regardless, this app setting does need more documentation and we will work on fixing that.

Hopefully this clears things up, but please let me know if you have any questions :)