Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.47k stars 4.8k forks source link

[QUERY] Azure Function Event Grid Handler scaling #30508

Closed RDavis3000 closed 2 months ago

RDavis3000 commented 2 years ago

Library name and version

Microsoft.Azure.Webjobs.Extensions.EventGrid 3.1.0

Query/Question

I have a data ingestion project which uses Event Grid with Azure Function subscribers/handlers/publishers, basically a function does its part of the data processing and then publishes an event which is picked up by the next function(s) in the chain. When doing a large ingestion (couple of million records) starting at 2022-08-15T06:11:33.73Z, the performance seems good initially, but after an initial spike of activity the functions seem to receive fewer and fewer event grid trigger hits and it just crawls along. After 24 hours I got a sudden load of 800K dead lettered events because their TTL had expired. The function app CPU rarely went above 40% and it has auto scale for CPU above 70%, running on a Premium Plan. I'm not sure why the event grid didn't push the functions harder considering there were hundreds of thousands of events waiting to be delivered. I've attached the relevant metrics, happy to share any other information required.

Function App: /subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.Web/sites/fa-u-lxr-aue-ingestion-doors

Event Grid Topic: /subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.EventGrid/topics/egt-u-lxr-aue-ingestion-doors

My query is basically how can I get EventGrid to deliver more events, faster, to the function triggers? And are you able to take a look behind the scenes and tell me if something was going wrong which I don't have visibility of?

EventGridMetrics.xlsx

Environment

Project Assemblies: Azure.Messaging.EventGrid 4.8.2 Microsoft.Azure.Webjobs.Extensions.EventGrid 3.1.0 Microsoft.NET.Sdk.Functions 4.1.9 Microsoft.Azure.Functions.Extensions 1.1.0

Running Azure Functions runtime ~4

dotnet --info .NET SDK (reflecting any global.json): Version: 6.0.303 Commit: 7494fee64a

Runtime Environment: OS Name: Windows OS Version: 10.0.22000 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\6.0.303\

global.json file: Not found

Host: Version: 6.0.8 Architecture: x64 Commit: 55fb7ef977

.NET SDKs installed: 3.1.422 [C:\Program Files\dotnet\sdk] 5.0.409 [C:\Program Files\dotnet\sdk] 6.0.301 [C:\Program Files\dotnet\sdk] 6.0.303 [C:\Program Files\dotnet\sdk]

.NET runtimes installed: Microsoft.AspNetCore.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.1.30 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Download .NET: https://aka.ms/dotnet-download

Learn about .NET Runtimes and SDKs: https://aka.ms/dotnet/runtimes-sdk-info

m-redding commented 2 years ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

JoshLove-msft commented 2 years ago

The issue appears to be that the Event Grid service is not delivering events to the function in a timely manner. @RDavis3000 can you confirm that the deadlettered events were never delivered to the function? If that is the case, then we would need to ask for assistance from the service team in troubleshooting this.

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl.

Issue Details
### Library name and version Microsoft.Azure.Webjobs.Extensions.EventGrid 3.1.0 ### Query/Question I have a data ingestion project which uses Event Grid with Azure Function subscribers/handlers/publishers, basically a function does its part of the data processing and then publishes an event which is picked up by the next function(s) in the chain. When doing a large ingestion (couple of million records) starting at 2022-08-15T06:11:33.73Z, the performance seems good initially, but after an initial spike of activity the functions seem to receive fewer and fewer event grid trigger hits and it just crawls along. After 24 hours I got a sudden load of 800K dead lettered events because their TTL had expired. The function app CPU rarely went above 40% and it has auto scale for CPU above 70%, running on a Premium Plan. I'm not sure why the event grid didn't push the functions harder considering there were hundreds of thousands of events waiting to be delivered. I've attached the relevant metrics, happy to share any other information required. Function App: /subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.Web/sites/fa-u-lxr-aue-ingestion-doors Event Grid Topic: /subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.EventGrid/topics/egt-u-lxr-aue-ingestion-doors My query is basically how can I get EventGrid to deliver more events, faster, to the function triggers? And are you able to take a look behind the scenes and tell me if something was going wrong which I don't have visibility of? [EventGridMetrics.xlsx](https://github.com/Azure/azure-sdk-for-net/files/9335484/EventGridMetrics.xlsx) ### Environment Project Assemblies: Azure.Messaging.EventGrid 4.8.2 Microsoft.Azure.Webjobs.Extensions.EventGrid 3.1.0 Microsoft.NET.Sdk.Functions 4.1.9 Microsoft.Azure.Functions.Extensions 1.1.0 Running Azure Functions runtime ~4 dotnet --info .NET SDK (reflecting any global.json): Version: 6.0.303 Commit: 7494fee64a Runtime Environment: OS Name: Windows OS Version: 10.0.22000 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\6.0.303\ global.json file: Not found Host: Version: 6.0.8 Architecture: x64 Commit: 55fb7ef977 .NET SDKs installed: 3.1.422 [C:\Program Files\dotnet\sdk] 5.0.409 [C:\Program Files\dotnet\sdk] 6.0.301 [C:\Program Files\dotnet\sdk] 6.0.303 [C:\Program Files\dotnet\sdk] .NET runtimes installed: Microsoft.AspNetCore.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.1.30 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.1.26 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 3.1.28 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 6.0.6 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 6.0.8 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Download .NET: https://aka.ms/dotnet-download Learn about .NET Runtimes and SDKs: https://aka.ms/dotnet/runtimes-sdk-info
Author: RDavis3000
Assignees: JoshLove-msft
Labels: `Event Grid`, `Service Attention`, `Client`, `customer-reported`, `question`, `needs-team-attention`
Milestone: -
RDavis3000 commented 2 years ago

Hi @JoshLove-msft , thanks for picking this up.

I can find that some dead-lettered events were in fact delivered and processed correctly by my function:

DeadLetter json: [ { "id": "b7a981e2-91e4-44bb-8b6a-0746fff9b20e", "eventTime": "2022-08-11T07:39:18.3285837Z", "eventType": "InstanceManager.StepExecutionComplete", "dataVersion": "1.0", "metadataVersion": "1", "topic": "/subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.EventGrid/topics/egt-u-lxr-aue-ingestion-doors", "subject": "InstanceManager.StepExecutionComplete", "deadLetterReason": "TimeToLiveExceeded", "deliveryAttempts": 2, "lastDeliveryOutcome": "TimedOut", "publishTime": "2022-08-11T07:39:18.2810770Z", "lastDeliveryAttemptTime": "2022-08-12T18:00:08.0839967Z", "data": { "StepId": "3f3a1188-828f-4007-a2f0-6ef9e2be8050", "InstanceId": "1fa3209d-a3df-4664-b427-06ff5b31e2a7", "Details": "Called by 3f3a1188-828f-4007-a2f0-6ef9e2be8050 because SELF CODE COMPLETE" } } ]

Processed at 2022-08-12T18:00:08.1721073Z Operation ID: ab1d67293751c72bcf45c4984ee2d52f

So that's weird. Given the somewhat messy nature of event duplication is it possible that two events were emitted and one was dead lettered and the other processed? Can that happen?

Here is another dead-letter record for an event which was actually delivered and processed:

[ { "id": "11168dd3-ba8c-4058-9f5b-c52535eca590", "eventTime": "2022-08-11T14:20:29.9733682Z", "eventType": "DOORS.ArtifactHeadersRemain", "dataVersion": "1.0", "metadataVersion": "1", "topic": "/subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.EventGrid/topics/egt-u-lxr-aue-ingestion-doors", "subject": "DOORS.ArtifactHeadersRemain", "deadLetterReason": "MaxDeliveryAttemptsExceeded", "deliveryAttempts": 8, "lastDeliveryOutcome": "TimedOut", "publishTime": "2022-08-11T14:20:29.9683774Z", "lastDeliveryAttemptTime": "2022-08-11T15:24:49.0072095Z", "data": { "IngestionRequestInstanceId": "1fa3209d-a3df-4664-b427-06ff5b31e2a7", "SourceIngestionRequestInstanceStepId": "96124739-4179-4986-bed6-e7416ee534b2", "EventModel": { "ProjectId": "_pzbw0FU3Eeu0u82irxw9ng", "Id": 0, "SourceUri": "https://lxra.clm.ibmcloud.com/rm/publish/resources/*?modifiedSince=1901-01-01T00%3A00%3A00.000Z&scoped=true&size=100&projectURI=_pzbw0FU3Eeu0u82irxw9ng&token=_AieMkBk5Ee2ItfaVDUgCYA&page=75" } } } ] Kusto smoking gun:

union 
    traces
| where timestamp > datetime("2022-08-10T02:56:23.169Z") and timestamp < datetime("2022-08-17T02:56:23.169Z")
| where * has "Child"
    and * has "instance"
    and * has "step"
    and * has "created"
    and * has "IngestionRequestInstanceId:1fa3209d-a3df-4664-b427-06ff5b31e2a7"
    and * has "URL:"
    and * has "https://lxra.clm.ibmcloud.com/rm/publish/resources/*?modifiedSince=1901-01-01T00%3A00%3A00.000Z&scoped=true&size=100&projectURI=_pzbw0FU3Eeu0u82irxw9ng&token=_AieMkBk5Ee2ItfaVDUgCYA&page=75"
| order by timestamp desc
| take 100

For the whole operation:

union
*
| where operation_Id == '51e924a16055a7e45f50c6f53bf326ac'

A bit more digging shows this record looks like it was delivered 7 times. operationId 51e924a16055a7e45f50c6f53bf326ac processed the event, for the other 6 the my code detected the duplicates and ignored them.

let opids = traces
| where timestamp > datetime("2022-08-10T02:00:41.207Z") and timestamp < datetime("2022-08-17T02:00:41.207Z")
| where * has "projectURI=_pzbw0FU3Eeu0u82irxw9ng&token=_AieMkBk5Ee2ItfaVDUgCYA&page=75"
| distinct operation_Id
| project operation_Id;
union isfuzzy=true
    requests,
    exceptions,
    traces
| where operation_Id in (opids)
| order by operation_Id, timestamp asc

But the majority have not been delivered, like this one with only one attempt in 24 hours:

{
    "id": "2e669d53-2b92-4af4-8808-63df0a1451f9",
    "eventTime": "2022-08-11T06:14:10.8365432Z",
    "eventType": "InstanceManager.StepExecutionComplete",
    "dataVersion": "1.0",
    "metadataVersion": "1",
    "topic": "/subscriptions/814f37ac-6bb0-4108-9c29-72909f9eef1e/resourceGroups/rg-u-lxr-aue-ingestion/providers/Microsoft.EventGrid/topics/egt-u-lxr-aue-ingestion-doors",
    "subject": "InstanceManager.StepExecutionComplete",
    "deadLetterReason": "TimeToLiveExceeded",
    "deliveryAttempts": 1,
    "lastDeliveryOutcome": "Probation",
    "publishTime": "2022-08-11T06:14:10.8206787Z",
    "lastDeliveryAttemptTime": "2022-08-11T06:14:10.8362971Z",
    "data": {
        "StepId": "c305c2fb-5782-484d-aca7-3fb2089744cd",
        "InstanceId": "1fa3209d-a3df-4664-b427-06ff5b31e2a7",
        "Details": "Called by 59eaf11b-94b7-42ea-85e0-670fad25c103 because SELF CODE COMPLETE *** Called by 59eaf11b-94b7-42ea-85e0-670fad25c103 because all finished ChildrenCount:0 HasChildren:False State:Finished *** Called by 8045ab1a-ec40-4bb5-bdc7-d802539edd60 because all finished ChildrenCount:2 HasChildren:True State:Finished"
    }
}

Let me know if you want more info

RDavis3000 commented 2 years ago

@m-redding @JoshLove-msft

This seems to have dropped off the queue, can it be assigned to someone else?

JoshLove-msft commented 2 years ago

Hi @RDavis3000, Since the root cause appears to be that the Event Grid service is not delivering events quickly enough, we've marked the issue as needing service attention. Another option at your disposal would be to file a support ticket from the Azure Portal.

github-actions[bot] commented 2 months ago

Hi @RDavis3000, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.