Azure / azure-functions-eventhubs-extension

Event Hubs extension for Azure Functions
MIT License
20 stars 26 forks source link

Event Hub Extension starves two partitions regularly #63

Closed icecog closed 2 years ago

icecog commented 4 years ago

Hi,

I've been having this issue for a while now and I cannot figure it out - I'd like to say I've tried everything, but I hope not.

The problem is that the Azure function consistently starves 2 partitions, a lease is taken and then they are just left there until the lease breaks (an hour or something later). Then two other partitioned are left to be starved. Whenever I redeploy the function it grabs all of them, but then soon starts to ignore two of them. It takes about 2 minutes for this problem to start.

I'm running an Azure function (v3 - project config below) that gets data from an event hub with 16 partitions. This is running on a App Service Plan with around 11 nodes (plenty of CPU to spare).

Operating System : Windows
Runtime version : 3.0.13353.0

And I have a deployment slot. This is probably the error as it sometimes doesn't stop consuming the event hub after swapping... but I'm not sure as even if I stop it this problem keeps happening (90% sure, I'll double check). (I have deployed the Slot using an ARM template, maybe something in there thats messing things up? I'm using Terraform for that, and so that is why it looks weird - posted at the bottom)

I've redeployed the function in question, I've messed with every (the 3) setting in the host.json, I've even cycled the machines on the app service plan (set scale to 1 and then to 11 again). I've even tested creating a new consumer group, but the same problem remains.

And this is beyond just upgrading to the latest nuget packages and trying to see if there is anything in my code - lucky me I have a second function that doesn't show this issue to compare with but no luck. So my guess is that there is something in the extension that's causing this.

Like I said, this is just true for one of two functions who reads the exact same data. But this problem only occurs in significant effect on one of them. The other may have the same problem, but to a much smaller degree for some reason. They both have almost the same setup, output clients for Event Grid and Redis Cache (not provided by Az Function runtime). the only real difference is that the problematic function has an integration with another event hub which is managed by me through an EventHubClient.

Blow are pictures of the the event hub checkpoints as well as what the graph of delay from message enqueued on the event hub looks like.

partition_checkpoints_startup partition_checkpoints_starvation_starting partition_checkpoints_starvation event_processing_delays (seconds)

var lastTimestamp = (DateTime)eventStream.Last().SystemProperties["x-opt-enqueued-time"];
var lastDelay = DateTime.UtcNow - lastTimestamp;

I'm running the following (with some internal packages removed)

  <PropertyGroup>
    <TargetFramework>netcoreapp3.0</TargetFramework>
    <AzureFunctionsVersion>v3</AzureFunctionsVersion>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="GeoCoordinate.NetCore" Version="1.0.0.1" />
    <PackageReference Include="GeoLibrary" Version="1.1.0" />
    <PackageReference Include="Microsoft.ApplicationInsights" Version="2.14.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Extensions" Version="1.0.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.EventHubs" Version="4.1.1" />
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="3.0.7" />
    <PackageReference Include="TaskTupleAwaiter" Version="1.2.0" />
  </ItemGroup>
  <ItemGroup>
    <None Update="host.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="local.settings.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      <CopyToPublishDirectory>Never</CopyToPublishDirectory>
    </None>
  </ItemGroup>
  <ItemGroup>
    <Folder Include="Properties\PublishProfiles\" />
  </ItemGroup>
  <ItemGroup>
    <ProjectReference Include="..\..\Service\Service.csproj" />
  </ItemGroup>
</Project>
# Run a raw ARM Template to create the slot 
resource "azurerm_template_deployment" "function_slot" {
  name = "create_function_slot-${azurerm_function_app.function.name}-${var.slotName}-${random_string.random.result}"
  parameters = {
    "functionSlotName"               = "${azurerm_function_app.function.name}/${var.slotName}"
    "functionName"                   = azurerm_function_app.function.name
    "slotName"                       = var.slotName
    "appServicePlan_Id"              = var.appServicePlanId
    "AppSettingsAsJsonAsBase64"      = base64encode(jsonencode(merge(local.app_settings,
     {
#     "FUNCTIONS_WORKER_RUNTIME"       = var.function_app_runtime
     "FUNCTIONS_EXTENSION_VERSION"    = var.function_app_extension_version
#     "APPINSIGHTS_INSTRUMENTATIONKEY" = var.applicationInsightsInstrumentationKey
     "AzureWebJobsStorage" = var.storageConnectionString
#     "WEBSITE_CONTENTAZUREFILECONNECTIONSTRING" = var.storageConnectionString
#     "AzureWebJobsDashboard" = var.storageConnectionString
#     "WEBSITE_CONTENTSHARE" = "${local.slotName}-content"
     }
   )))
  }
  resource_group_name    = var.resourceGroupName
  deployment_mode = "Incremental"

  template_body = <<BODY
  {
      "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
      "contentVersion": "1.0.0.0",
      "parameters": {
          "functionSlotName": {"type": "string", "defaultValue": ""},
          "functionName": {"type": "string", "defaultValue": ""},
          "slotName": {"type": "string", "defaultValue": ""},
          "appServicePlan_Id": {"type": "string", "defaultValue": ""},
          "AppSettingsAsJsonAsBase64": {"type": "string", "defaultValue": ""}
      },
      "variables": {
      },
      "resources": [
          {
          "type": "Microsoft.Web/sites/slots",
          "apiVersion": "2018-11-01",
          "name": "[concat(parameters('functionName'),'/',parameters('slotName'))]",
          "location": "[resourceGroup().location]",
          "dependsOn": [
          ],
          "kind": "functionapp",
          "properties": {
              "enabled": true,
              "hostNameSslStates": [
              ],
              "serverFarmId": "[parameters('appServicePlan_Id')]",
              "reserved": false,
              "isXenon": false,
              "hyperV": false,
              "scmSiteAlsoStopped": false,
              "clientAffinityEnabled": true,
              "clientCertEnabled": false,
              "hostNamesDisabled": false,
              "containerSize": 1536,
              "dailyMemoryTimeQuota": 0,
              "httpsOnly": false,
              "redundancyMode": "None",
              "siteConfig": {
                "alwaysOn": false
            }
          },
          "resources": [
            {
              "apiVersion": "2018-11-01",
              "name": "appsettings",
              "type": "config",
              "dependsOn": [
                "[resourceId('Microsoft.Web/sites/slots', parameters('functionName'),parameters('slotName'))]"
              ],
              "properties": "[base64ToJson(parameters('AppSettingsAsJsonAsBase64'))]"
            }
        ]
      }
      ],
    "outputs": {
      "functionSlotName": {
        "type": "string",
        "value": "[parameters('functionSlotName')]"
      },
      "functionName": {
        "type": "string",
        "value": "[parameters('functionName')]"
      },
      "slotName": {
        "type": "string",
        "value": "[parameters('slotName')]"
      }
    }
  }
  BODY
}
icecog commented 4 years ago

I increased the app service plan to 16 nodes and removed the deployment slot but to no avail. partition leases

Let me also say that, though I doubt it has any impact, we're receiving between 1000 - 3000 messages a second across all partitions.

And here is the host.json config

{
  "extensions": {
    "eventHubs": {
      "batchCheckpointFrequency": 100,
      "eventProcessorOptions": {
        "maxBatchSize": 100,
        "prefetchCount": 200
      }
    }
  },
  "version": "2.0"
}

But I've tried everything short of setting prefetchCount to 0.

icecog commented 4 years ago

It turns out it was the damned staging slot that was taking up the partition leases, I'm not sure how I'll solve it, but first step is to avoid using a arm template from 2015 to provision it... Even after removing the slot in its entirety the problem persisted for a little bit - or it may have been my imagination. But eventually the production slot started firing on all cylinders.

icecog commented 4 years ago

Nope, I was wrong, it's not the staging slot. :/ Can it be that there is some way that a function can shut off without releasing the lease it has? And can we fix this somehow by specifying a shorter lease time?

alrod commented 2 years ago

@icecog, do you still experiencing the issue?

icecog commented 2 years ago

I no longer work there and so have no idea if this issue persists.

Feel free to take whatever action you consider best with this issue

alrod commented 2 years ago

Closing as no relevent