Open fvigo opened 5 years ago
This is not generally expected. Checking the status of a specific instance ID should normally be a sub-second operation because it's a very inexpensive and scalable lookup in Azure Table storage (looking up a single row using the row key and partition key). The number of instances should not affect this.
I think what you'll need to do is try to narrow down where the latency is coming from. For example, is there any latency on the client side? Could there be a network issue? Could it be because your function app itself which is handling the request is exhausting CPU? Could it be the storage account you're using? Anything you can do to try and narrow it down via experimentation, etc. would be helpful.
Thanks @cgillum : I did some more testing and with proper configuration, it takes 100~200ms to get a result back when querying a specific instance, which is good news.
However, when doing a search, the query takes consistently between 10 and 60 seconds. This is an example of a search, in a narrow (5 min) timeframe:
$ curl -o /dev/null -w "@curl-format.txt" -s "https://$SITE/runtime/webhooks/durabletask/instances?runTimeStatus=Running,Pending&createdTimeFrom=2018-12-01T08:30:30.000Z&createdTimeTo=2018-12-01T08:35:00.000Z&code=$SYSTEMKEY"
Elapsed time(s):27.875
The DurableFunctionsHubInstances table contained ~6500 entries.
I tried to purge everything and, once the purge operation was complete, the search time went back to about 250ms.
Note that purging the table took almost one hour, and the number of instances reported (483) didn't coincide with the number of rows originally present in the table (~6500, after the purge it was empty):
PS C:\code> func durable purge-history
Purged orchestration history for all instances created between '1/1/0001 12:00:00 AM' and '12/30/9999 11:59:59 PM'
Instances deleted: 483
Rows deleted: 59435
There seem to be no problems in the network and in the storage: is it expected for the purge operation to take so long? And for the search/scan across ~6500 entries to take over a minute?
Thanks!
I do expect some latency for searching because we're using table storage, which is known to be not very scalable. Same for purging, which is expected to be even slower. Exactly how slow, I'm not sure yet. I'll try to run a few tests to see if I get comparable numbers.
Here are some tests I ran. In each case, I was running from a Cloud Shell instance running in Azure. The function app was running on a large VM (4 cores):
Querying 1,000 orchestrations: 0.97 seconds
curl -o /dev/null -s -w 'Total: %{time_total}\n' "https://$SITE/admin/extensions/DurableTaskExtension/instances?code=$EXT_CODE"
Total: 0.968
Purging 1,000 orchestrations (each has 3 activity functions): 36.5 seconds
curl -o /dev/null -s -w 'Total: %{time_total}\n' -X DELETE "https://$SITE/tests/PurgeHistory?code=$PURGE_CODE"
Total: 36.500
Querying 10,000 orchestrations: 3.9 seconds
curl -o /dev/null -s -w 'Total: %{time_total}\n' "https://$SITE/admin/extensions/DurableTaskExtension/instances?taskHub=DFPerfHub&connection=Storage&code=$EXT_CODE"
Total: 3.897
Purging 10,000 instances: 230.5 seconds
curl -o /dev/null -s -w 'Total: %{time_total}\n' -X DELETE "https://$SITE/tests/PurgeHistory?code=$PURGE_CODE"
Total: 230.486
Actually, I think that last result (230 seconds) might not be accurate because the Azure App Service will terminate HTTP requests that take around this amount of time. But it definitely didn't take as long as an hour. I'll try again with the functions CLI when I get a chance.
Definitely a big difference from my numbers. I also run the query tests from an Azure VM. Storage latency didn't see any spikes, it's a Standard storage without other activity on it.
I am running on a Consumption Plan, could that be the issue? Is it recommended to run Durable Functions on such plans?
Hi,
Using v1.6.2, I've been using the HTTP API for two reasons:
So I either pool the status of a specific instanceId (1) or search for instances spawned in the last few seconds (2) that are either in Pending or Running state.
The nature of this workflow requires me to get an answer quite quickly, so I set the HTTP client timeout to 1s. This worked well initially but I'm recently I'm experiencing timeouts (response taking several seconds to come back) in both cases.
Is this expected? What is a legit expectation to get a reply from the HTTP API in either case? Is the number of previous runs going to influence this? If so, will the flush help?
Thanks!