Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 271 forks source link

[Feature Request] Orchestration Auto-Purge #892

Open mpaul31 opened 5 years ago

mpaul31 commented 5 years ago

** Let me know if this is not the right place to request a feature and I'll be happy to move.

It would be really nice to have a configuration setting in host.json to have orchestrations automatically clean-up themselves similar to how the PurgeInstanceHistoryAsync method works today.

I'm sure there are many production applications in-flight as we speak that overlooked this and would rather have this happen in the background.

Thanks and keep up the great work on this awesome product!

cgillum commented 5 years ago

Thanks for opening this issue! We'll look into this.

ghost commented 4 years ago

Is there any idea when this will be worked on? We currently have issues where manually purging on a timer hits the function execution time limit continuously. Having it auto purged would be great for successful runs.

Edit: As I'll be removing my work account in favor of my main account, I can contacted at @Archomeda from now on for further updates.

cgillum commented 4 years ago

I've added this issue to be scheduled for our next release. I think it makes a lot of sense for us to have this feature.

olitomlinson commented 4 years ago

In my use-case, I use a trigger on a 15 minute schedule to purge orchestrations once they have been in a completed runtime state for longer than 2 weeks. This allows me 2 weeks of diagnostic time to investigate any of my orchestrations in production.

However for orchestrations that are Running or Failed, I never delete these.

So I would suggest having configurable time periods for each runtime state, ranging from

—-

Please also emit a ‘customMetric’ which contains A count of how many orchestrations were purged, and their runtime state.

sebastianburckhardt commented 4 years ago

If we establish some process for periodic maintenance operations, It may make sense to also support automatic cleaning of entity storage (which currently requires an explicit API call, see #1442).

amdeel commented 4 years ago

Feature Proposal

Implement an Automatic Storage Cleanup feature that runs on a set time to perform cleanup maintenance operations for Durable Functions storage. It will call PurgeInstanceHistoryAsync and CleanEntityStorageAsync to delete Orchestration and Entity history, but could be expanded in the future for other cleanup operations.

We could make the settings configurable in the users host.json. I included some potential default settings:

bool UseAutomaicStorageCleanup -- (default false) Timespan CleanupTimespan -- frequency to run cleanup operations (default 1 day) Timespan PurgeOrchestrationsOfAge -- setting to purge orchestration instances of at least this age (default 2 weeks) IEnumerable PurgeOrchestrationsWithStatuses -- (default Completed)

Considerations

Where do we implement this feature?

Durable extension

DurableTask.AzureStorage

Edit: Some offline discussions lead us to deciding to implement this feature in DT.AzureStorage.

anel-al commented 3 years ago

@cgillum :)

It would be great to have this feature as part of platform.

While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?

Thank you

ahawes-clarity commented 3 years ago

While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?

I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.

https://github.com/Azure/azure-functions-durable-extension/issues/1145

I am still looking for a solution. All the while, the histories just keep adding up.

mpaul31 commented 3 years ago

While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?

I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.

1145

I am still looking for a solution. All the while, the histories just keep adding up.

depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.

ahawes-clarity commented 3 years ago

While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?

I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.

1145

I am still looking for a solution. All the while, the histories just keep adding up.

depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.

@mpaul31 Thanks for the inspiration to putting a workaround in place. I did as you said and have my Timer trigger output to a Queue. I made the Queue function only call the PurgeInstanceHistoryAsync method with a 1 hour range. But, I also made the Queue function recursive so that it calls itself if the hour it was processing was less than a predetermined date (like 7 days ago). In the normal situation my Timer will be running once an hour and only 1 hour will be processed. But, should something happen, and I need to clean up more I just manually inject a message in the queue for a date further back in time and the Queue function will just keep re-calling itself hour by hour until it gets caught back up to the predetermined date. I used this process to clean up the backlog I had.

But, in the end, I really think something like this should be build into the platform.

mpaul31 commented 3 years ago

While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?

I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.

1145

I am still looking for a solution. All the while, the histories just keep adding up.

depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.

@mpaul31 Thanks for the inspiration to putting a workaround in place. I did as you said and have my Timer trigger output to a Queue. I made the Queue function only call the PurgeInstanceHistoryAsync method with a 1 hour range. But, I also made the Queue function recursive so that it calls itself if the hour it was processing was less than a predetermined date (like 7 days ago). In the normal situation my Timer will be running once an hour and only 1 hour will be processed. But, should something happen, and I need to clean up more I just manually inject a message in the queue for a date further back in time and the Queue function will just keep re-calling itself hour by hour until it gets caught back up to the predetermined date. I used this process to clean up the backlog I had.

But, in the end, I really think something like this should be build into the platform.

great i'm glad it worked for you! curious, how long did i take for your entire clean-up process to complete?

ahawes-clarity commented 3 years ago

great i'm glad it worked for you! curious, how long did i take for your entire clean-up process to complete?

A couple hours I think.

anel-al commented 3 years ago

Hi,

Would this approach work as quick and dirty cleanup of large backlog of items so we have clean start with regular timer based cleanup? Is there anything I am missing?

  1. stop all service bus triggers and let the messages accumulate
  2. wait for all orchestrations to complete
  3. stop functions application
  4. rename task hub to something nameXXX
  5. start application
  6. let application recreate new artifacts
  7. delete old artifacts (such as old tables)
  8. continue with regular purge api calls

Thank you

adeliab commented 3 years ago

hi @bachuv will the auto purge functionality be part of the 28 May release? We were going to create a custom function for the purge, but if this will be released end of this month then we will wait :)

thanks

bachuv commented 3 years ago

Hi @adeliab, the Auto-Purge feature actually won't be part of the May 28th release (v2.5.0), but we are prioritizing this feature and it should be part of one of the next few releases.

extremilio commented 3 years ago

This might help someone who finds this thread while googling for the solution to IDurableOrchestrationClient.PurgeInstanceHistoryAsync timeouts when running it in Azure functions. Especially if trying to get rid of a lot of history backlog..

I guess this is due to Azure table storage not having proper transactions, but it will remove a lot of entries every run before a FunctionTimeoutException is thrown. So that's a feature in this case..

So we simply let it timeout for the first couple of runs until it had purged the bulk of history, then we run it daily to remove new history continuously.

chetanmnit commented 3 years ago

When this feature is going to release? If already available please tell me how we can set this configuration.

KelvinTegelaar commented 2 years ago

Just checking in when this feature is going to be released, We've found that PowerShell durables leave gigabytes of data in the instance history, dramatically increasing costs.

bachuv commented 2 years ago

@KelvinTegelaar thank you for providing that feedback. We are currently prioritizing this feature and it should be included in one of our upcoming releases although no concrete date has been set.

AdrianTVB commented 2 years ago

Hey, just wondering if there has been any update for when this feature will be released?

cgillum commented 2 years ago

No updates to share at this time, unfortunately. The work still needs to be prioritized and scheduled.

es-alt commented 2 years ago

Any feasible workarounds? Our storage keeps on growing and our own attempts at deleting the finished orchestrations fail because of time outs.

cgillum commented 2 years ago

The timeouts issue should hopefully be resolved in our next extension release. In the meantime, you can do manual purging using smaller time windows, or implement single instance purging. For example, you can query for instances that are in the completed state and then issue multiple purge operations on them individually in parallel.

LSNilis commented 2 years ago

Any feasible workarounds? Our storage keeps on growing and our own attempts at deleting the finished orchestrations fail because of time outs.

Whether or not the following is feasible for you is totally up to you to decide, this is however how we work around the limitation. It may help or inspire you.

We noticed the Durable Framework offers two functions to purge an orchestration, the second one purges just one at a time: image

With that in mind we created a queue and then added a QueueTrigger: image

And an Activity to add the IDurableOrchestrationContext.InstanceId to said queue: image

Right before the orchestration ends, we determine how long we want to keep the orchestration and pass that along when we call the activity. The default delay of one minute hard coded within the activity is completely arbitrary.

In our case we have one main orchestration which at most can start up to two sub-orchestrations. So it's not too bad to have each of those add themselves to the purge-queue. We do handle around 5 million main orchestrations each month, all of which (including their sub-orchestrations) are purged when we no longer need them.

Before handling it this way, we accrued a bit of a backlog. We got rid of that by determining which orchestrations needed to be purged, then added each of those to the purge-queue and let the QueueTrigger go plugging away at it.

SamVanhoutte commented 2 years ago

The timeouts issue should hopefully be resolved in our next extension release. In the meantime, you can do manual purging using smaller time windows, or implement single instance purging. For example, you can query for instances that are in the completed state and then issue multiple purge operations on them individually in parallel.

when is that extension release planned, @cgillum ? any issue that I can track for this? performance is becoming unmanageable in our production environment

davidmrdavid commented 2 years ago

@SamVanhoutte: I'm actively working on a release right now. If everything goes well, a new release should be out before the end of the week, most likely earlier.

gunzip commented 2 years ago

Any news about this?

davidmrdavid commented 2 years ago

@amdeel: Is this the issue you were looking to assign to @nytiannn?

amdeel commented 2 years ago

@davidmrdavid Yes. If you have questions like this you can just talk to me offline.

Insadem commented 1 year ago

Any news?

sebastianburckhardt commented 1 year ago

BTW, if you are looking for a workaround until this feature is implemented, I posted some code snippets that show how to run a periodic timer function to purge completed orchestrations in this comment: https://github.com/microsoft/durabletask-netherite/issues/229#issuecomment-1452490498

gunzip commented 1 year ago

The problem with using date ranges to find deleteable records is that it is extremely slow when you have a large number of entries in Azure Storage. A better (and probably faster) method would be to save the orchestrator identifiers in some way and selectively purge them using the id.

stevebus commented 3 months ago

Any news on this enhancement? I'm having trouble with the PurgeInstanceHistory API not acting as I expect it to (will open a separate issue on that), but man it would be nice to not have to do this ourselves...