Azure / azure-functions-python-worker

Python worker for Azure Functions.
http://aka.ms/azurefunctions
MIT License
335 stars 103 forks source link

Nondeterministc deploy bug with 5+ different non-descript failures on two different OS's, worked before, support stumped #1306

Open ZirconCode opened 1 year ago

ZirconCode commented 1 year ago

Unsure how to title this. It's been a week of debugging and isolating with no results. Is this issue in the correct repo? Also unsure. Help me out here. I'll walk you through my journey.

I have a very large code base, it runs flawlessly locally. I've deployed it just fine for a long time until recently. I made a change, including google-cloud-texttospeech in requirements.txt. It stopped deploying after this (maybe relevant, maybe not). Removing this change, to the exact same code base as before, still fails to deploy.

Some errors I get at random, and I've tried incredibly hard to isolate. Both development environments can deploy other things successfully, and push to a new app fine as well. I have not made any local environment changes at all when it began to fail.

Details: Azure Functions runtime version 4.24.4.4, Linux premium plan with Elastic Premium EP3. I deploy directly from visual studio code azure function extension. My python code follows the folder structure of the v1 coding model (no decorators, lots of folders), however I am on v2 (host.json etc.), this has always worked, and runs locally of course. Deploying to python 3.9.7.

First development environment: Manjaro linux, vscode 1.81.1, azure extension 1.12.3

Errors I've encountered seemingly at random, when trying to deploy:

2:50:41 PM debugApp123: **Deployment successful**. deployer = ms-azuretools-vscode deploymentPath = Functions App ZipDeploy. Extract zip. Remote build.
2:51:07 PM debugApp123: Syncing triggers...
2:51:12 PM debugApp123: Querying triggers...
2:51:18 PM debugApp123: **No HTTP triggers found.**
3:34:11 PM debugApp123: Deployment Failed. deployer = ms-azuretools-vscode deploymentPath = Functions App ZipDeploy. Extract zip. Remote build.
3:34:25 PM debugApp123: Deployment failed.
4:00:49 PM: Error: The operation was aborted.
10:52:10 AM ae-API-compute: Deployment successful. deployer = ms-azuretools-vscode deploymentPath = Functions App ZipDeploy. Extract zip. Remote build.
10:52:42 AM ae-API-compute: Syncing triggers...
10:52:49 AM ae-API-compute: Querying triggers...
10:52:52 AM ae-API-compute: WARNING: Some http trigger urls cannot be displayed in the output window because they require an authentication token. Instead, you may copy them from the Azure Functions explorer.

Except it was not successful and functions are empty / don't run (the zip file in webjobstorage contains them thought).

11:55:04 AM debugApp123: Syncing triggers...
11:55:45 AM debugApp123: Syncing triggers (Attempt 2/6)...
11:55:56 AM debugApp123: Syncing triggers (Attempt 3/6)...
11:56:17 AM debugApp123: Syncing triggers (Attempt 4/6)...
11:56:59 AM debugApp123: Syncing triggers (Attempt 5/6)...
11:58:20 AM debugApp123: Syncing triggers (Attempt 6/6)...
11:59:12 AM: Error: Encountered an error (ServiceUnavailable) from host runtime.

And a very exciting shiny rare one:

Offset to Central Directory cannot be held in an Int64.

Second development environment: Windows, visual studio code, azure extension v1.12.4. I've also used this environment before, and it has worked, same as the above one.

Errors:

One successful deploy, and randomly most of the above (with no changes), as well as a new one:

10:35:34 AM ae-API-compute: Starting deployment...
10:35:34 AM ae-API-compute: Creating zip package...
10:45:11 AM: Error: socket hang up

Some other random things I've tried:

What now? The errors are not helpful. Logs are missing, I can drill down into events, insights, many different logs, they are all over the place and they are all useless or empty. I've mentioned that I've had multiple 2hr+ calls with the technical support team of various ever-increasing escalations, and they are just as stumped as me (and I am grateful for their efforts).

Any thoughts?

What do I try next? Any information I can provide?

See also (maybe relevant, I don't know at this point): https://github.com/microsoft/vscode-azurefunctions/issues/3805 https://github.com/microsoft/vscode-azurefunctions/issues/2529 https://github.com/microsoft/Oryx/issues/1774 https://stackoverflow.com/questions/76478668/adding-python-module-google-cloud-storage-is-causing-a-working-azure-function-ap https://stackoverflow.com/questions/72441758/typeerror-descriptors-cannot-not-be-created-directly https://github.com/projectkudu/kudu/issues/3348 https://github.com/microsoft/azure-pipelines-tasks/issues/14201 https://github.com/microsoft/vscode-azurefunctions/issues/2529 https://github.com/microsoft/Oryx/issues/1774

bhagyshricompany commented 1 year ago

pls share the function name,app name,instance id,timestamp ,region etc.

ZirconCode commented 1 year ago

function name: all, since the deploy doesn't work app name: as in the logs above, ae-API-compute, but also debugApp123 instance id: azure function instance id? (i.e. ExecutionContext.InvocationId?), not relevant since it is a deploy error timestamp: for some examples see logs above region: west europe

etc.: I also wish I could provide the relevant information to isolate the error, however the error messages have not allowed me to do so.

ZirconCode commented 1 year ago

Setting PYTHON_ISOLATE_WORKER_DEPENDENCIES:1 also does not resolve the issue.

bhagyshricompany commented 1 year ago

pls create the support request on azure portal https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

ZirconCode commented 1 year ago

As mentioned previously, I am already in contact with azure support.

ZirconCode commented 1 year ago

Tried:

Ran into two new bugs while isolating:

I've been using .funcignore and disabling functions to try to isolate the problem within the scope of my larger project, since it was not possible to do so from a clean project upwards. Both these things seem to invite a host of new issues.

The combination of all these issues makes it impossible to work. It is very disappointing.

ZirconCode commented 1 year ago

I have isolated and reproduced reliably one of the vague errors listed above:

4:00:49 PM: Error: The operation was aborted.

This specific reproducible isolated case was fixable with:

Also a note that all my logfiles are still non-existent on failed deployment. This shouldn't be the case.

ZirconCode commented 1 year ago

So, the above error

2:39:54 PM ae-api-compute-secondary: Writing the artifacts to a Zip file
2:40:18 PM: Error: The operation was aborted.

came back when including further pieces of my project.

I isolated it to the line import tempfile. For some reason this causes the error. It worked previously. This also causes the abort when I have it in a default httptrigger template function by itself.

ZirconCode commented 1 year ago

I have discovered a new non-reproducible randomly appearing bug:

2:57:46 PM ae-api-compute-secondary: Starting deployment...
2:57:46 PM ae-api-compute-secondary: Creating zip package...
2:58:02 PM ae-api-compute-secondary: Zip package size: 180 MB
2:58:04 PM ae-api-compute-secondary: Fetching changes.
2:58:06 PM ae-api-compute-secondary: Cleaning up temp folders from previous zip deployments and extracting pushed zip file /tmp/zipdeploy/1455b335-1904-468e-8ed3-3384bf99dbe4.zip (0.00 MB) to /tmp/zipdeploy/extracted
2:58:06 PM ae-api-compute-secondary: Central Directory corrupt.
2:58:13 PM ae-api-compute-secondary: Deployment failed.

I'm not even going to try to figure that one out.

I isolated the next reason for abort to including openai in requirements.txt (no importing). During the pip install the there are no errors and deploy seems to be satisfied, however it aborts at the end.

ZirconCode commented 1 year ago

So, I have solved the deployment issue as a final step, at least for me, by using a specific AUR and deploying from the terminal instead of the azure extension:

  1. Arch repo for working azure cli (currently): https://aur.archlinux.org/packages/azure-cli
  2. azure login
  3. func azure functionapp publish appName --slot slotName

Interestingly, the deployment zip is around 200mb smaller, though both do a remote build.

I will keep this issue open because I think it highlights the need for better/existent logging and error feedback in many cases. The above combination of steps got my project deployable again, however I will likely never know what was broken, and why it happened without my agency, and good luck to anyone with similar issues.

jannikmi commented 11 months ago

In case it helps others: Check the environment variable names you are using. They might conflict with Azure specific variable names and thereby cause errors. In my case no http triggers were found (failing silently), because of the the environment variable CONTAINER_NAME.

Very annoying and impossible to debug. Please add verbose error messages to the deployment output!

lucazav commented 9 months ago

For almost a year now, there has been such a bug in Azure Functions for Python that does not allow them to be used profitably, and which consists in displaying the message "No HTTP triggers found" at the end of deployment from VS Code, despite the fact that the function code works correctly.

At this link you will find the desperate attempt of developers to report the anomaly in an issue in the GitHub repository of Oryx: https://github.com/microsoft/Oryx/issues/1774

Rightly, Paul Dorsh responds after some reporting on this issue that the problem is not with Oryx (used to build the code), but the problem is in the deployment part to the Azure function: https://github.com/microsoft/Oryx/issues/1774#issuecomment-1509093908

Paul points to the vscode-azurefunction repo, where someone takes the report and fixes the bug, but only for Node.js, not Python: https://github.com/microsoft/vscode-azurefunctions/issues/3805

So much so that Simon from Zirconcode is urged to open this issue in the azure-functions-python-worker repo: https://github.com/Azure/azure-functions-python-worker/issues/1306. It was opened in mid-August, and as of today (mid-December) has still not been fixed, despite the severity of the bug.

One of the latest report, which always came in the first issue thread on Oryx, reports the following:

This bug forced me to remake the entire app for AWS Lambda. Nothing in this thread worked sadly. I prefer Azure but what can we do when something is just completely broken with no concrete solution.

(from https://github.com/microsoft/Oryx/issues/1774#issuecomment-1835920134)

I myself had to implement a feature for a major customer using an Azure function with Python, and I sweated my nuts off trying to figure out what was wrong (from 1 day of implementation, it took me 5!).

Is it possible that this serious bug cannot be addressed in a reasonable time?