Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
716 stars 271 forks source link

Stopping at Running status #324

Closed yuka1984 closed 6 years ago

yuka1984 commented 6 years ago

When SubOrchestrator was called in Orchestrator, a phenomenon occurred in which Parent Orchestrator was not executed and stopped in the state of Running status even if SubOrchestrator terminated. This symptom is a phenomenon as low as it occurs randomly when 2000 Orchestrators are executed in parallel at present. I have not reproduced this phenomenon only three times. The environment of my Durable Functions is running on the Consumption plan.

cgillum commented 6 years ago

Thanks for opening this issue - I'm sorry that you're running into problems. Let me try to repeat your explanation just to make sure I understand correctly:

  1. Parent orchestrator starts a SubOrchestrator
  2. SubOrchestrator completes (or is terminated?)
  3. Parent orchestrator never resumes - it stays in the Running state.

Is this correct?

Can you share the following information so I can investigate?

Thanks!

yuka1984 commented 6 years ago

Thank you for the quick response. Chris.

Yes, it is correct.

SubOrchestrator completes (or is terminated?)

It seems that SubOrchestrator has not ended normally. The information written in the Instance Table also differs from SubOrchestrator that completed successfully.

Instance Id 9b56a35cacd444df88914044c2d7c603

Region East Japan

Time 2018-05-31T15:28:10.833Z

cgillum commented 6 years ago

Thank you. I have found trace data for your orchestration.

I can see that you have a parent orchestration LockManagementOrderAllocate and it tries to call a sub-orchestration named OrderAllocate. Is that correct?

We have "end-to-end" traces which show us what the orchestrator function is doing. I am seeing some strange things in these traces. I'm wondering whether there may be problems with the orchestrator function code. Would you be able to send me what your code looks like over email? I will send you my email address.

cgillum commented 6 years ago

I spent more time looking into this. It seems the OrderAllocate orchestration is getting into a stuck state and never completing. I have not yet determined why.

However, I also noticed something else odd. It appears that there are two different versions of the DurableTask.AzureStorage nuget package processing messages on the task hub named BatchStockSync578 using the same storage account. One appears to be the GA version and one appears to be a pre-GA version. The OrderAllocate sub-orchestration instance (ID = f042ee82308a44cb9675c93d73ed0ae5:1) started running on the GA version at 2018-05-31 15:28:21, but later started running on a non-GA version at 2018-05-31 15:33:45.9847693.

Do you have multiple versions of this function app running against the same task hub?

yuka1984 commented 6 years ago

Thank you Chris.

I feel it is a very mysterious log. Because the number "578" in TaskhubName is the build number of VSTS. I have created a VSTS pipeline to rewrite host.json 's TaskHunName before build. There should be only one deployment version called TaskHubName "BatchStockSync 578". However, logs are actually listed. I will examine whether there was no operation that could result in such a phenomenon.

cgillum commented 6 years ago

Hi Yuka. Did you have a chance to investigate your VSTS pipeline?

I recently spoke to another customer who observed a similar deployment problem. Basically, they deployed a new version of the Durable Functions extension, but the update did not have any effect until they restarted the function app. I'm wondering if there is a bug in Azure Functions deployment which can cause this.

yuka1984 commented 6 years ago

Hello, Chris. I think that there is no problem with the VSTS pipeline. I try to set up to restart Function App at the end of the VSTS pipeline.

There was a phenomenon that Orchestrator remained in Running state when I was testing today. All of this Orchestrator SubOrchestrator has been completed, but this Orchestrator does not change to Completed.

InstanceID : e44eb936554842588043f92b20bb171e Region : JapanEast Time : 2018-06-05T09:43:37.468Z

cgillum commented 6 years ago

Thank you for confirming. I will investigate.

cgillum commented 6 years ago

Here are some notes I've taken for this investigation so far regarding e44eb936554842588043f92b20bb171e:

There are some problems with my tracing code, so I cannot get much more information than this. However, I think these "lost" messages might be stuck in memory. If you restart the function app, does your orchestration complete?

yuka1984 commented 6 years ago

Hi Chris. I tried restarting but there was no change in the state. The History of the new SubOrchestrator was not recorded either.

cgillum commented 6 years ago

Thank you for confirming. I am still actively investigating this.

One thing I noticed is that there was a failure in the Durable Task Framework to schedule some of your sub-orchestrators. It was a TimeoutException when attempting to add messages to the Azure Storage queue. The framework is supposed to handle this correctly, but I wonder if maybe there is a bug where we do not correctly retry. I do not yet know if this is the cause. I will keep investigating.

cgillum commented 6 years ago

@yuka1984, do you know how many sub-orchestrations were supposed to have been scheduled? I mentioned 1578, but I wonder if that number is actually too high. What is the expected number?

yuka1984 commented 6 years ago

Thank you Chris. In the current design, it starts from around 1600 and is going to gradually increase. If there is a possibility that the second problem (SubOrchestrator will not be executed) can be solved by restricting the number of concurrences, I will modify Orchestrator.

Do you think how much SubOrchestraotr should be execute parallels?

cgillum commented 6 years ago

For orchestration instance e44eb936554842588043f92b20bb171e, do you know the exact number of expected sub-orchestrators? Is 1600 the exact number of is it just an estimate?

yuka1984 commented 6 years ago

The scheduled number of SubOrchestrator runs is 1578.

cgillum commented 6 years ago

Thanks:

Do you think how much SubOrchestraotr should be execute parallels?

I think 1600 is okay.

Do you still see this problem if you set partitionCount to 1 in host.json? Here is some documentation.

yuka1984 commented 6 years ago

I tried "partitionCount to 1". As a result, "Non-Deterministic workflow detected: SubOrchestrationInstanceCreatedEvent: 1090 SubOrchestrationInstanceCreated LockManagementOrderAllocate a15c5c6018c24d04ac78b762d154a75f: 1090" We detected an error saying Orchestrator was Failed.

InstanceId: "eb4e53b94f9f4bdda30b4d56140244ce" Region: Japan East Time: 2018-06-08T08: 19: 21.111Z

cgillum commented 6 years ago

Very interesting. Thank you for this information.

Sorry that this investigation is going so slowly. Can you share with me an orchestration instance ID that completed successfully? I would like to compare the success case with the failed case.

yuka1984 commented 6 years ago

At present, since the system using Durable Functions had adjustment for release, the response was delayed. I'm sorry. I have made some modifications to Orchestrator in my previous issue communication. In the release version, the method to control FanOut of SubOrchestrator at the same time to 100 by Orchestrator's program is operating stably. But, this method is not efficient and I will continue cooperate with the solution of the issue I reported this time.

https://github.com/Azure/azure-functions-durable-extension/issues/324#issuecomment-394670326 I will report the successfully running InstanceId for this report.

InstanceId:07356f8e689a4a6cadd3212cdea2e68b Region:EastJapan Time:2018-06-05T09:59:39.143Z

But, the number of SubOrchestrator executions in this InstaneId is very small. There was no successful InstanceId of the same condition as if it failed.

https://github.com/Azure/azure-functions-durable-extension/issues/324#issuecomment-395706457

There is no instance of Successful InstanceId of this reply at present. However, there are success stories in which partitionCount is not set (default: 4).

InstanceId:4b2783f66b7b482c8eca0822c7f5d29c Region:EastJapan Time:2018-06-07T05:06:29.742Z

If you want, I can create a successful version InstanceId with older versions of Orchestrator.

cgillum commented 6 years ago

I'm glad you were able to improve reliability. Thank you for letting me know. I would still like to continue investigating your original issue. I have created some tracing improvements to Durable Functions v1.5.x release which will be available later this week. Please try using it when it is available so I can collect more information. I will let you know when it can be downloaded.

yuka1984 commented 6 years ago

A case where Orchestrator does not become Complete during production operation has been reproduced.

Instanceid:1d15631f139e44588b00e541be38961b Region:EastJapan Time:2018-06-22T01:17:58

In the History of this Instance, the following message was written.

Non-Deterministic workflow detected: SubOrchestrationInstanceCreatedEvent: 1485 SubOrchestrationInstanceCreated LockManagementOrderAllocate 1d7f24810b1340f99acf0a0d3501ddcf:1485

But this sub orchstrator instance is correctly completed.

cgillum commented 6 years ago

Thanks, the "Non-Deterministic workflow detected" message is interesting. I've been thinking about this, and I suspect this can possibly happen if we fail when writing messages to a queue. It might be related to this: https://github.com/Azure/azure-functions-durable-extension/issues/332

Yesterday we released v1.5 update the Durable Functions. I added many new traces to help me investigate your issue. If you get a chance, can you try upgrading to this version and let me know if you see this problem again? That will make it easier for me to investigate.

cgillum commented 6 years ago

This issue has gone stale, so I will close it for now. However, please do try the v1.6.0 release which is available now. It contains many reliability improvements based on recent performance testing investigations. The internal tracing is also much improved, making it easier for us to investigate issues.