Allow a "newer than" timestamp to be specified for blob trigger

mathewc commented 7 years ago

Currently our blob scan algorithm will process ALL blobs in the target container that don't have blob receipts. We should investigate whether we can allow the start date for the scan to be specified.

Scenario: assume all blobs in a container have been processed by a blob trigger function in a particular app (WebJob host). Now, if that function is moved to a different app (different host/host ID) all the blobs will be reprocessed, because there are no receipts for those blobs for that host ID.

brettsam commented 7 years ago

I agree we should expose something to control this. Since we already maintain the blob scan pointer for tracking our last processed blob, we'd need to make sure that the behavior makes sense when these two interact.

For sake of argument -- let's call this new property newerThan.

For example:

newerThan = Jan 1, 2017. Last blob scan was Dec 1, 2016 -- We'd skip over all of December when we start processing.
newerThan = Jan 1, 2017. Last blob scan was Feb 1, 2017 -- We wouldn't want to re-process all of January, would we?

In other words -- we'd start our scan from whichever was newest between newerThan and the stored blob scan pointer.

As a side note -- I think writing out informational logs (like we do for Timer) would be very helpful here. Something like Found blob scan pointer of {date} and NewerThan value of {date}. Starting scan at {date} because it is the most recent. To change this, .... It'd only write out once at Listener start and could go a long way towards explaining the logic without needing to look up docs.

ransagy commented 7 years ago

This would be very helpful in a few scenarios i came across. My current case - scanning over SQL audit blobs generated by Azure's SQL Blob Auditing feature. We have a pretty high retention rate for those but only need to process the logs going forward, Which sounds perfect for an Azure Function with a Blob Trigger - Until you realize you have to let it run in a NOOP style over all of them, for each host, before its usable.

This would really help similar scenarios.

paulbatum commented 7 years ago

One possibility that could help here is using Event Grid's support for routing storage events to azure functions. This approach does not involve any blob scanning which is the cause of the main issue here.

https://docs.microsoft.com/en-us/azure/event-grid/resize-images-on-storage-blob-upload-event

jaltin commented 6 years ago

Resurfacing this as this is something I would love to be able to do. Any idea on if/when this might be looked at?

Thx!

paulbatum commented 6 years ago

No idea at this time (that's what the "unknown" milestone means).

jtlz2 commented 5 years ago

Another year has passed - any update?

rollsch commented 5 years ago

Any update? This is kind of annoying as I have to sit there waiting for 10 minutes for the trigger to reprocess each blob. I'm not sure why but the receipts get reset sometimes which means it will reprocess everything

pablosguajardo commented 4 years ago

Hello, I'm the same. it fires 3 times. surely they are the events I have created for testing. But how do I eliminate them all so I create a new one to run only that one?

In the cosole there are 3 events that are triggered at the same time: 2020-09-18T15:28:13.541 [Information] Executing 'D2KEventGrid' (Reason='EventGrid trigger fired at 2020-09-18T15:28:13.2184099+00:00', Id=595fc416-0280-43e1-8dc5-f285640e986c) 2020-09-18T15:28:13.569 [Information] Executing 'D2KEventGrid' (Reason='EventGrid trigger fired at 2020-09-18T15:28:13.0399176+00:00', Id=f1796f9a-f06c-47e8-8e67-be34950629a3) 2020-09-18T15:28:13.570 [Information] Executing 'D2KEventGrid' (Reason='EventGrid trigger fired at 2020-09-18T15:28:12.7600042+00:00', Id=c34bb611-2c99-4c1d-ba4b-5675cc87236c)

paulbatum commented 4 years ago

@pablosguajardo I think you're talking about something different to what is being discussed here, because it looks you are using eventgrid, while this issue is discussing the behavior of the built-in blob trigger..

Floriszz commented 3 years ago

What about this additional parameter 'Start time'? Does this have anything to do with this? I can't find documentation about this parameter.

santi-paz commented 3 years ago

Any update on this? This feature would be very useful.

tbasallo commented 2 years ago

Is this related to the same blob being triggered for multiple hosts? For example, a blob already processed by a production Function, is also triggered when a dev machine runs the function/project locally. We've seen files from YEARS start to trigger for processing.

bdlb77 commented 2 years ago

Any Updates on this functionality?

nicm-CC commented 1 year ago

Any update on the above discussion?

v-bafa commented 1 year ago

Really need this feature!! Please help add it : )

abouroubi commented 10 months ago

More than 6 years after, can we at least have news about it ?

ToniPR commented 8 months ago

I am also hoping for this feature as well, cause when you first publish the trigger in the function app and you have an existing blob with files. You upload one for testing, instead of taking just the testing one. It takes the testing plus any other files in it. But after that, its fine. Next file you upload, it will only trigger for that folder.

https://stackoverflow.com/questions/51675455/stop-azure-blob-trigger-function-from-being-triggered-on-existing-blobs-when-fun

mario-dnet commented 6 months ago

This would be really helpful!

widget31too commented 3 months ago

I too vote for some way to prevent old files from triggering new functions. Just deployed a change to how I process files (new function, rather than "remodeling" of old function), and have over 1500 files worth of useless processing going on.

At least we were smart, and have a check that the file has already been added to our system!

Azure / azure-webjobs-sdk

Allow a "newer than" timestamp to be specified for blob trigger #1327