[BUG] Blob Trigger Scan continuously firing, and processing the same blobs

chriscartled commented 1 year ago

Library name and version

Azure.Storage.Blobs 12.15.1

Describe the bug

I have a blob trigger set up to montitor one of my containers in Azure Storage. The associated scan over the blobs is firing non-stop, looping over the same blobs again and again, causing excessive operations on the storage account.

Expected behavior

Would expect blobs already processed to be skipped.

Actual behavior

A scan is firing immediately after the last one has finished. I'm not sure if that's expected behaviour, but what is not expected, is that the same blobs are being detected by the scan over and over again.

I have noticed that at the start of each scan, this log message is printed to the function logs:

[Verbose] Poll for blobs newer than '0001-01-01T00:00:00.000' in container '<container>' with ClientRequestId 'xxx' found 1839 blobs in 302 ms. ContinuationToken: False

Comparing with other blob triggers, I have discovered that the relevant scaninfo file seems to not have been created/updated. If I manually create the scaninfo file with a date/time, then the next scan will poll for blobs newer than that time, but the scaninfo file is then not updated again automatically.

Blob receipts are being created, and each blob is processed with a message similar to the following:

[Verbose] Blob '<blob name>' will be skipped for function 'FillBlobMetaData' because this blob with ETag '"0x8D8798F7B192688"' has already been processed. PollId: '566d148e-11d0-4a80-86a0-c7d4cfab3b58'. Source: 'ContainerScan'.

Reproduction Steps

Enable my function within the function app on the Azure Portal.

Environment

.NET 6.0 app Azure Functions, runtime version 4.16.5.20396

Microsoft.Azure.WebJobs.Extensions.Storage.Blobs v5.1.1

github-actions[bot] commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

navba-MSFT commented 1 year ago

@chriscartled Apologies for the late reply. Thanks for reaching out to us and reporting this issue. We are looking into this issue and we will provide an update.

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @ahmedelnably @fabiocav.

navba-MSFT commented 1 year ago

Adding Service team to look into this.

@ahmedelnably @fabiocav Could you please provide some pointers on this? Thanks in advance.

alexsorokoletov commented 1 year ago

+1 to this problem, also applies to Node.js based functions

yuanweilei commented 1 year ago

Also in java Blob Trigger. The message shows every 10s.

My code is very simple.

package com.function;

import com.microsoft.azure.functions.annotation.*;
import com.microsoft.azure.functions.*;

public class BlobExample {

    @FunctionName("BlobExample")
    @StorageAccount("AzureWebJobsStorage")
    public void run(
        @BlobTrigger(name = "content", path = "container/{name}", dataType = "binary") byte[] content,
        @BindingName("name") String name,
        final ExecutionContext context
    ) {
        context.getLogger().info("Java Blob trigger function processed a blob. Name: " + name + "\n  Size: " + content.length + " Bytes");
        //context.getLogger().info("Java Blob trigger function processed a blob. Content: " + content);
    }
}

badbort commented 1 year ago

I encountered the same issue with my dotnet function. I eventually tracked it down to the ScanBlobScanLogHybridPollingStrategy.PollNewBlobsAsync method.

This was causing millions of transactions per day

The blob paged api has a page continuation token value to determine if the current page is the last page. That token is an empty string for the last page, but the code was checking against null. This resulted in the date time never being assigned and was always DateTime.Min. Thus every blob get checked again and again against that date

I noticed someone else fixed the issue here: https://github.com/Azure/azure-sdk-for-net/commit/a69839f4a33ab39a7d1441e36415f08e00b76ff0

I think Microsoft.Azure.WebJobs.Extensions.Storage.Blobs 5.1.2 introduced the fix

edit: After switching to the latest version (5.1.3) I still found this polling mechanism continuously checks the most recent batch of files due to these lines:

https://github.com/Azure/azure-sdk-for-net/blob/c4fb4c52117622c2107163ba6d3efbb20743836d/sdk/storage/Microsoft.Azure.WebJobs.Extensions.Storage.Blobs/src/Listeners/ScanBlobScanLogHybridPollingStrategy.cs#L265-L268

Above CurrentSweepCycleLatestModified is set to the most recent last modified date. This becomes the LastSweepCycleLatestModified afterwards.

If there are no new files coming in, then LastSweepCycleLatestModified remains assigned to the last modified date of the most recent file - still unchanged.

As a result, those most recent files will be checked over and over again, continuously:

https://github.com/Azure/azure-sdk-for-net/blob/c4fb4c52117622c2107163ba6d3efbb20743836d/sdk/storage/Microsoft.Azure.WebJobs.Extensions.Storage.Blobs/src/Listeners/ScanBlobScanLogHybridPollingStrategy.cs#L270-L275

According to the comment earlier about rounding, it seems like you cant really mess with the logic.

In my case I had copied over a batch of 60 files at the same time (out of ~700 in the container), and they were continuously being checked every 10 seconds (PollingInterval)

jonmunm commented 11 months ago

+1 NodeJs 20 Windows

Same behaviour. It's even stranger, since new blobs are also skipped,with the same message

const { app, output } = require('@azure/functions');
const csv = require('csvtojson');
const { v4: uuidv4 } = require('uuid');

const eventHubOutput = output.eventHub({
    eventHubName: 'mut-admin-detalle-tabla-dividendo',
    connection: 'EventHubConnection',
});

app.storageBlob('mut-admin-detalle-tabla-dividendo', {
    path: 'landing/mut-admin-detalle-tabla-dividendo/{name}',
    connection: 'StorageConnection',
    return : eventHubOutput,
    handler: async (blob, context) => {
        context.log(`Storage blob function processed blob "${context.triggerMetadata.name}" with size ${blob.length} bytes`);
        const stringBlob = blob.toString("utf8");
        const batchId = uuidv4().toString();
        const landedTimestamp = new Date();
        const blobUri = context.triggerMetadata.uri;
        const jsonBlob = await csv({'delimiter': '|'})
                                .fromString(stringBlob)
                                .subscribe((json => {
                                    return new Promise((resolve, reject) => {
                                        try {
                                            json['batch_id'] = batchId;
                                            json['message_id'] = uuidv4().toString();
                                            json['landed_out_dateTime'] = landedTimestamp;
                                            json['blob_uri'] = blobUri
                                            resolve();
                                        } catch(e) {
                                            reject(e);
                                        }
                                    })
                                }));

        context.log(`Sending ${jsonBlob.length} messages to Event Hubs`)
        return jsonBlob;
    }
});

Azure / azure-sdk-for-net