Open mpaul31 opened 5 years ago
Thanks for opening this issue! We'll look into this.
Is there any idea when this will be worked on? We currently have issues where manually purging on a timer hits the function execution time limit continuously. Having it auto purged would be great for successful runs.
Edit: As I'll be removing my work account in favor of my main account, I can contacted at @Archomeda from now on for further updates.
I've added this issue to be scheduled for our next release. I think it makes a lot of sense for us to have this feature.
In my use-case, I use a trigger on a 15 minute schedule to purge orchestrations once they have been in a completed runtime state for longer than 2 weeks. This allows me 2 weeks of diagnostic time to investigate any of my orchestrations in production.
However for orchestrations that are Running or Failed, I never delete these.
So I would suggest having configurable time periods for each runtime state, ranging from
—-
Please also emit a ‘customMetric’ which contains A count of how many orchestrations were purged, and their runtime state.
If we establish some process for periodic maintenance operations, It may make sense to also support automatic cleaning of entity storage (which currently requires an explicit API call, see #1442).
Implement an Automatic Storage Cleanup feature that runs on a set time to perform cleanup maintenance operations for Durable Functions storage. It will call PurgeInstanceHistoryAsync and CleanEntityStorageAsync to delete Orchestration and Entity history, but could be expanded in the future for other cleanup operations.
We could make the settings configurable in the users host.json. I included some potential default settings:
bool UseAutomaicStorageCleanup -- (default false)
Timespan CleanupTimespan -- frequency to run cleanup operations (default 1 day)
Timespan PurgeOrchestrationsOfAge -- setting to purge orchestration instances of at least this age (default 2 weeks)
IEnumerable
Where do we implement this feature?
Durable extension
DurableTask.AzureStorage
Edit: Some offline discussions lead us to deciding to implement this feature in DT.AzureStorage.
@cgillum :)
It would be great to have this feature as part of platform.
While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?
Thank you
While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?
I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.
https://github.com/Azure/azure-functions-durable-extension/issues/1145
I am still looking for a solution. All the while, the histories just keep adding up.
While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?
I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.
1145
I am still looking for a solution. All the while, the histories just keep adding up.
depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.
While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?
I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.
1145
I am still looking for a solution. All the while, the histories just keep adding up.
depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.
@mpaul31 Thanks for the inspiration to putting a workaround in place. I did as you said and have my Timer trigger output to a Queue. I made the Queue function only call the PurgeInstanceHistoryAsync method with a 1 hour range. But, I also made the Queue function recursive so that it calls itself if the hour it was processing was less than a predetermined date (like 7 days ago). In the normal situation my Timer will be running once an hour and only 1 hour will be processed. But, should something happen, and I need to clean up more I just manually inject a message in the queue for a date further back in time and the Queue function will just keep re-calling itself hour by hour until it gets caught back up to the predetermined date. I used this process to clean up the backlog I had.
But, in the end, I really think something like this should be build into the platform.
While we are waiting, what is current official recommendation for application with large amount of history data? Timer based function with call to delete history API?
I have been trying to do the timer based function for a while using the PurgeInstanceHistoryAsync method but I run into Functions timeout issues. That's why this needs to be built-in to the platform so it just works.
1145
I am still looking for a solution. All the while, the histories just keep adding up.
depending upon how much data you store on average per hour / day, your timer trigger could push one or more messages to a storage queue with the start and end time period to purge. so if this timer trigger runs once per day, add one message per hour to the queue to distribute the work load. you can tweak the retry limit in the host.json if need be.
@mpaul31 Thanks for the inspiration to putting a workaround in place. I did as you said and have my Timer trigger output to a Queue. I made the Queue function only call the PurgeInstanceHistoryAsync method with a 1 hour range. But, I also made the Queue function recursive so that it calls itself if the hour it was processing was less than a predetermined date (like 7 days ago). In the normal situation my Timer will be running once an hour and only 1 hour will be processed. But, should something happen, and I need to clean up more I just manually inject a message in the queue for a date further back in time and the Queue function will just keep re-calling itself hour by hour until it gets caught back up to the predetermined date. I used this process to clean up the backlog I had.
But, in the end, I really think something like this should be build into the platform.
great i'm glad it worked for you! curious, how long did i take for your entire clean-up process to complete?
great i'm glad it worked for you! curious, how long did i take for your entire clean-up process to complete?
A couple hours I think.
Hi,
Would this approach work as quick and dirty cleanup of large backlog of items so we have clean start with regular timer based cleanup? Is there anything I am missing?
Thank you
hi @bachuv will the auto purge functionality be part of the 28 May release? We were going to create a custom function for the purge, but if this will be released end of this month then we will wait :)
thanks
Hi @adeliab, the Auto-Purge feature actually won't be part of the May 28th release (v2.5.0), but we are prioritizing this feature and it should be part of one of the next few releases.
This might help someone who finds this thread while googling for the solution to IDurableOrchestrationClient.PurgeInstanceHistoryAsync timeouts when running it in Azure functions. Especially if trying to get rid of a lot of history backlog..
I guess this is due to Azure table storage not having proper transactions, but it will remove a lot of entries every run before a FunctionTimeoutException is thrown. So that's a feature in this case..
So we simply let it timeout for the first couple of runs until it had purged the bulk of history, then we run it daily to remove new history continuously.
When this feature is going to release? If already available please tell me how we can set this configuration.
Just checking in when this feature is going to be released, We've found that PowerShell durables leave gigabytes of data in the instance history, dramatically increasing costs.
@KelvinTegelaar thank you for providing that feedback. We are currently prioritizing this feature and it should be included in one of our upcoming releases although no concrete date has been set.
Hey, just wondering if there has been any update for when this feature will be released?
No updates to share at this time, unfortunately. The work still needs to be prioritized and scheduled.
Any feasible workarounds? Our storage keeps on growing and our own attempts at deleting the finished orchestrations fail because of time outs.
The timeouts issue should hopefully be resolved in our next extension release. In the meantime, you can do manual purging using smaller time windows, or implement single instance purging. For example, you can query for instances that are in the completed state and then issue multiple purge operations on them individually in parallel.
Any feasible workarounds? Our storage keeps on growing and our own attempts at deleting the finished orchestrations fail because of time outs.
Whether or not the following is feasible for you is totally up to you to decide, this is however how we work around the limitation. It may help or inspire you.
We noticed the Durable Framework offers two functions to purge an orchestration, the second one purges just one at a time:
With that in mind we created a queue and then added a QueueTrigger:
And an Activity to add the IDurableOrchestrationContext.InstanceId to said queue:
Right before the orchestration ends, we determine how long we want to keep the orchestration and pass that along when we call the activity. The default delay of one minute hard coded within the activity is completely arbitrary.
In our case we have one main orchestration which at most can start up to two sub-orchestrations. So it's not too bad to have each of those add themselves to the purge-queue. We do handle around 5 million main orchestrations each month, all of which (including their sub-orchestrations) are purged when we no longer need them.
Before handling it this way, we accrued a bit of a backlog. We got rid of that by determining which orchestrations needed to be purged, then added each of those to the purge-queue and let the QueueTrigger go plugging away at it.
The timeouts issue should hopefully be resolved in our next extension release. In the meantime, you can do manual purging using smaller time windows, or implement single instance purging. For example, you can query for instances that are in the completed state and then issue multiple purge operations on them individually in parallel.
when is that extension release planned, @cgillum ? any issue that I can track for this? performance is becoming unmanageable in our production environment
@SamVanhoutte: I'm actively working on a release right now. If everything goes well, a new release should be out before the end of the week, most likely earlier.
Any news about this?
@amdeel: Is this the issue you were looking to assign to @nytiannn?
@davidmrdavid Yes. If you have questions like this you can just talk to me offline.
Any news?
BTW, if you are looking for a workaround until this feature is implemented, I posted some code snippets that show how to run a periodic timer function to purge completed orchestrations in this comment: https://github.com/microsoft/durabletask-netherite/issues/229#issuecomment-1452490498
The problem with using date ranges to find deleteable records is that it is extremely slow when you have a large number of entries in Azure Storage. A better (and probably faster) method would be to save the orchestrator identifiers in some way and selectively purge them using the id.
Any news on this enhancement? I'm having trouble with the PurgeInstanceHistory API not acting as I expect it to (will open a separate issue on that), but man it would be nice to not have to do this ourselves...
** Let me know if this is not the right place to request a feature and I'll be happy to move.
It would be really nice to have a configuration setting in
host.json
to have orchestrations automatically clean-up themselves similar to how thePurgeInstanceHistoryAsync
method works today.I'm sure there are many production applications in-flight as we speak that overlooked this and would rather have this happen in the background.
Thanks and keep up the great work on this awesome product!