Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 267 forks source link

Ad-hoc launching failure of Durable function instance when running a few parallel ones #1383

Closed sunnyhay closed 3 years ago

sunnyhay commented 4 years ago

Description

My team is using Durable Function to interact with Adobe for marketing data ingestion. The ingestion service is embedded into Azure Data Factory to kick off several Durable function instances for parallel run. The typical ingestion workflow is given 5 data batches, the ingestion service will trigger 5 parallel Durable function instances, each for one data batch. Please shoot email to suhai@microsoft.com for more details. There is no log at all in either ADF or AppInsight for Durable function.

Expected behavior

Expect stable runs of 5 parallel Durable function instances to handle each data batch.

Actual behavior

Recently I found sometimes one Durable function instance fails to be launched. In a few seconds the ADF web activity which wraps the running of Durable function fails without any log. This is an ad-hoc problem. I used to see consecutive dozens of runs with 5 parallel Durable function instances without any problem. But sometimes the ad-hoc problem happens and fails one of the parallel instances.

Relevant source code snippets

// host.json
{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "AEPIngestorHub"
    }
  }
}

The internal code repo can be found here.

Known workarounds

This is the ADF web activity input to invoke the failed Durable function instance:

{
"url": "https://mps-ingestorservice-ppe.azurewebsites.net/api/OrchestrationClientStart?code=xYAM3Llg2wHNcSG2Y/7br9cyyl9ZLzjaLTEcvUosbAY7mWVTo/mqsA==",
"method": "POST",
"headers": {
"Content-Type": "application/json"
},
"body": "{\"folder\":\"processing\",\"pathToFile\":\"aepprocessing/AEPSource/Azure/AdvisorRecommendations/5ea89a127ca9d318a828902b/processing/20200610_091852.parquet\",\"XCV\":\"3c09b346-6dd6-4a0e-8851-bacfbff300cc\"}",
"linkedServices": [
{
"referenceName": "AADService_Ingestion",
"type": "LinkedServiceReference"
}
],
"authentication": {
"type": "MSI",
"resource": "https://mps-ingestorservice-ppe.azurewebsites.net"
}
}

and the output of the web activity

{
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"executionDuration": 2,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}

App Details

Screenshots

See the screenshot for a pipeline run. In 5 seconds the Durable function instance fails. image

If deployed to Azure

We have deployed function in PPE.

cgillum commented 4 years ago

Hi @sunnyhay - I took a look at your app and found some issues which are probably impacting your solution. You should be able to see the same analysis I'm looking at if you check out the Diagnose and Solve section of the Azure Functions portal.

In particular, it looks like you're reusing the same storage account and task hub name across multiple apps:

App Name Task Hub Storage Account Example Instance ID
mps-webhookservice-ppe AEPIngestorHub mpsblobaccountppe b109c87d8d3c460eaaa83536f0ac8923
mps-ingestorservice-ppe AEPIngestorHub mpsblobaccountppe b109c87d8d3c460eaaa83536f0ac8923

When you do this, both of your apps will compete for messages in the same queues, often causing things to not run correctly. For that reason, each app you deploy must use either a unique task hub name or use a different storage account. More information here: https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-task-hubs?tabs=csharp

Also, you may want to upgrade the version of the nuget package you're using from 1.8.3 to 1.8.5 (even better if you can move to 2.2.2, but there are some breaking changes between 1.x and 2.x).

sunnyhay commented 4 years ago

Thanks for prompt response Chris. Good catch! We've resolved such conflicting issue before. Yes, both services are using the same task hub intentionally since webhook service is used to send message to ingestor service. The webhook service is receiving messages pushed by Adobe IO and sends external events to waiting ingestor service. And webhook service is a pure Azure function, not a Durable function. My team is gonna have a big release next week and so breaking changes of dependency upgrade will be considered later after that. Sorry I forgot to mention the service plan details. We're using Production S1 box for app service plan. Lemme know if there is any scale concern for this issue. And this is an intermittent problem and we have no clue to track so far.

sunnyhay commented 4 years ago

Since our function is wrapped in Data factory web activity, it's also possible ADF encounters some problem failing the function launching. I'd like to know how to troubleshoot this scenario using any function related debugging knowledge if possible. For any IaaS case, it's simply logon the machine and checking the log. For such PaaS case, may I get any valuable suggestion? Thanks.

cgillum commented 3 years ago

Here are the best resources for debugging in Azure Functions. The information here applies both to PaaS and self-hosted: