Azure / azure-functions-durable-python

Python library for using the Durable Functions bindings.
MIT License
136 stars 55 forks source link

Long running issue with Durable Python function #256

Closed tamathew closed 3 years ago

tamathew commented 3 years ago

I'm running Snowflake SQLs via ADF->Az function activity which calls a Python Durable Az function. When I tested with a long running SQL : "call system$wait(9, 'MINUTES)", it ran beyond 9 min.. and I aborted the job at 35th minute. The status of the statusQueryGetUri is below

Output -

{
  "name": "ExecSnowSQLDurableOrchestrator",
  "instanceId": "452918153ab741ab94e69ce37581408c",
  "runtimeStatus": "Running",
  "input": "{\"sql\": \"call system$wait(9, 'MINUTES')\", \"activity_name\": \"Runner_2\", \"factory_name\": \"gdapsandboxadf\", \"pipeline_name\": \"PL_PYTHON_DURABLE_FUNCTION_POC\"}",
  "customStatus": null,
  "output": null,
  "createdTime": "2021-01-26T19:43:26Z",
  "lastUpdatedTime": "2021-01-26T20:06:30Z"
}

The output log from Azure Monitor- DurableActivity however shows that there was an outcome of the SQL. But for some reason it was not getting updated in the webhook --> statusQueryGetUri.

This issue is also intermittent. Please let me know if you need more info.

2021-01-26 19:57:30.261 query: [call system$wait(9, 'MINUTES')] Information 2021-01-26 20:06:30.355 query execution done Information 2021-01-26 20:06:30.357 Output of the SQL execution : waited 9 minutes Information

DurableOrchestrator code -

async def main(req: func.HttpRequest, starter: str) -> func.HttpResponse:
    try :
        status_code=400
        req_body=""
        output=None
        client = df.DurableOrchestrationClient(starter)
        req_body = req.get_body().decode()
        payload = json.loads(req_body)
        instance_id = await client.start_new(req.route_params["functionName"], client_input=payload) ### Start DurableOrechestartor func
        logging.info(f"Started orchestration with ID = '{instance_id}'.")
        # durable_orchestration_status = await client.get_status(instance_id)
        response = client.create_check_status_response(req, instance_id)
        #return client.create_check_status_response(req, instance_id)
        logging.info("Starter Response is below : ")
        logging.info(response)
        return response
    except Exception as e :
        logging.error(f"Exception occured in Starter : {str(e)}")
        json_output=json.dumps(str(e))
        log={"status":"Failed","status_code":400,"azfunc_output":json_output, "warehouse" : ""}
        return func.HttpResponse(json_output,mimetype='application/json',status_code=400)`
tamathew commented 3 years ago

Hi @cgillum - Here is the log for af450c5387364f1d8aee5325e31a4522 orchestratorLog_af450c5387364f1d8aee5325e31a4522.xlsx

Can you check this

cgillum commented 3 years ago

Those orchestrator logs are interesting because it looks like the same activity function was scheduled to run twice.

2/15/2021, 6:37:37.393 PM af450c5387364f1d8aee5325e31a4522: Function 'ADFMetaExtractorActivity (Activity)' started. IsReplay: False. Input: (636 bytes). State: Started. SlotName: Production. ExtensionVersion: 2.3.1. SequenceNumber: 5. TaskEventId: 0
2/15/2021, 6:32:36.702 PM af450c5387364f1d8aee5325e31a4522: Function 'ADFMetaExtractorActivity (Activity)' started. IsReplay: False. Input: (636 bytes). State: Started. SlotName: Production. ExtensionVersion: 2.3.1. SequenceNumber: 12. TaskEventId: 0

The difference between these two log statements is 5 minutes, which is also the visibility timeout on the queues. I also see that these two log statements were generated by two different VMs, so I wonder if the first VM picked it up, got stuck or killed, and then a second VM picked up the message 5 minutes later and executed the activity function immediately. I'll need to spend more time digging into this to see what happened exactly, but it so far appears to be a different issue from what we've been investigating up until now.

tamathew commented 3 years ago

Hi @cgillum - What is the fundamental difference in way .NET/C# Durable Function is being executed vs Python Durable function being executed ? Can you explain in plain English ..lol :)

tamathew commented 3 years ago

@cgillum - Do you have any update on this issue ?

cgillum commented 3 years ago

What is the fundamental difference in way .NET/C# Durable Function is being executed vs Python Durable function being executed ?

Hey @tmathewlulu the fundamental difference is with the underlying runtime itself. In .NET/C#, an app can use multiple threads to process concurrent requests. .NET will also automatically add or remove threads as needed. Python, however, was designed to just use one thread at a time. If you need anything to run concurrently in Python, then you need to manually configure it to have more threads.

The way most triggers work in Azure Functions is that they try to execute multiple requests concurrently. This works great for C#/.NET because .NET will happily create multiple threads if needed to handle all the concurrent requests. Sadly, Python does not do this, so Function invocations get blocked waiting for a free thread to start executing your code. The workaround for this is to try and configure the Azure Functions trigger so that it doesn't try to take on more work than the Python worker can handle at any given time.

I hope that makes sense. Let me know if I can help clarify further.

cgillum commented 3 years ago

Do you have any update on this issue?

So I think we're now talking about two different things. One is the concurrency behavior of Python functions. I think we've sufficiently covered this topic as far as I can tell and that there aren't any known issues requiring further investigation.

The other, regarding instance af450c5387364f1d8aee5325e31a4522 where I saw a duplicate execution, it looks like the container your app was running on was terminated mid-execution at 2021-02-15 18:32:57.0817379. This can happen if, for example, the platform is going through an upgrade or if a scale-in operation was scheduled. I'll need to follow up with the Azure Functions Consumption Linux team to understand what the exact cause was. It's my understanding that you've opened a support request already so I'll pass this information along so that they can do a root cause analysis.

tamathew commented 3 years ago

chrisFunction_d7743cf04f854dd3a65d6ae2a95493a7.xlsx

@cgillum - Python concurrency issue - This issue persists. The code you shared is not working. In the attached log, the orchestrator started at 2/22/2021, 9:49:27.026 PM and activity started at 2/22/2021, 9:56:35.935 PM There is a 7 min delay. Why ??

cgillum commented 3 years ago

@tmathewlulu what were your values for PYTHON_THREADPOOL_THREAD_COUNT, FUNCTIONS_WORKER_PROCESS_COUNT, maxConcurrentActivityFunctions, and maxConcurrentOrchestratorFunctions?

tamathew commented 3 years ago

@cgillum - This is the setting I used. I also tried without the below setting, also turned the activity to async + asyncio.sleep "FUNCTIONS_WORKER_PROCESS_COUNT": 1, "PYTHON_THREADPOOL_THREAD_COUNT": 10,

cgillum commented 3 years ago

What about the other two concurrency settings in host.json?

tamathew commented 3 years ago

Hi @cgillum - The other settings were not given. However I reran again with below settings...

"FUNCTIONS_WORKER_PROCESS_COUNT": 1, "PYTHON_THREADPOOL_THREAD_COUNT": 10, "extensions": { "durableTask": { "maxConcurrentActivityFunctions": 5, "maxConcurrentOrchestratorFunctions": 5 } Also async qualifier for activity function and asyncio.sleep for sleep and ran 15 parallel iterations with various sleep intervals - To my surprise all jobs completed as expected.

I got greedy and executed my original business requirement of running snowflake sql sleep commands and is not behaving as expected. It has the long running issue.

I will have to do more investigation by tweaking settings.

davidmrdavid commented 3 years ago

Hi @tamathew!

It appears this issue hasn't gotten activity in a bit so I was wondering if it was resolved for now. Just asking as part of our regular GitHub maintenance and clean-up work. Thanks!

tamathew commented 3 years ago

Hi @davidmrdavid - I did not get a time to do my final round of testing. Will update this thread once I'm done.

tamathew commented 3 years ago

Hi @cgillum @davidmrdavid - I'm intermittently seeing issues with Azure Python durable function. However I rewrote my code in chsarp to execute Snowflake SQL queries and it worked well. So we decided to go with csharp runtime for now.

davidmrdavid commented 3 years ago

Hi @tamathew - Sounds good. Glad to know you got unblocked at least. As Durable Python continues to mature, I'm sure we'll see more use-cases with ADF and Snowflake SQL so do check back-in in the future. I'll be closing this issue for now. Thanks!