Azure / azure-functions-python-worker

Python worker for Azure Functions.
http://aka.ms/azurefunctions
MIT License
333 stars 103 forks source link

[BUG] Function occasionally hangs with no error reported and timeouts #1245

Closed ajstewart closed 1 year ago

ajstewart commented 1 year ago

Hello, I have a Queue triggered Azure Function that performs some light processing on a single image file that should take no more than about 10-30 seconds.

I am using the v1 python programming model (due to some bindings not being ready yet that I'd like to use), and it's all synchronous.

It's using a consumption plan on linux.

Nearly every invocation goes through fine without any issue. But on multiple occurrences a day, suddenly the function just hangs and timeouts. There is no other error, it just stops, with the next message being the function timeout (it sounds very much like what happens in #1213).

Then on the queue message retry that happens because of the failure it will go through without a problem.

I'm unable to copy code here, but for context the functions process looks like:

It is not a high rate of triggers, only around 10-20 an hour say.

It always consistently hangs on one of those steps above. I thought it was something I'm doing but considering it works most of the time, maybe I am not accounting for something that's happening that I'm not aware of or how the code is being executed. Really I have no idea and it's driving me a bit nuts 😅

Investigative information

Failure Example Details:

Repro steps

I am unable to reproduce this locally, all my tests go through ok

Expected behavior

Function runs without hanging and timing out (as it does most of the time).

Actual behavior

Hangs occasionally, approx 3 - 10 times a day.

Related information

I realise I haven't provided much information here, I'm happy to provide more but will have to do so outside of this public issue.

bhagyshricompany commented 1 year ago

Thanks for reporting this issue. you can share the repro steps here: https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately pls provide the function name.region etc for checking.

ajstewart commented 1 year ago

Turns out I can self-close this issue, after much hunting around and experimenting and reading this page again on how synchronous functions are run: https://learn.microsoft.com/en-us/azure/azure-functions/python-scale-performance-reference

it suspiciously looked like thread locking, so I added

"PYTHON_THREADPOOL_THREAD_COUNT" = 1

to the app settings and sure enough that has fixed the problem. I experimented by removing it and the timeouts returned. I've now had multiple days of no timeouts with forcing only 1 thread.

Why the thread locking occurs I'm not sure, maybe something to do with a scikit-learn method I am using as that seems to be the spot it consistently broke.

@bhagyshricompany would you still like to know the details to investigate? I did include an invocation id above, can you use that? The region is uk west.