Non-Deterministic workflow detected [...] ID 0

epa095 commented 1 year ago

Description

My python durable function runs for varying amount of time (sometimes a few hours, sometimes a few days), but always exits before its done with the error Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'fetch_activity' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running? I am reasonably sure that there is no nondeterminism in the orchestrator (see below), and it has not been redeployed while running. It is also always ID 0.

Relevant source code snippets

Here is the complete orchestrator:


import logging

import azure.durable_functions as df
import azure.functions as func

LOGGER = logging.getLogger(__name__)

def orchestrator_function(context: df.DurableOrchestrationContext):
    """Orchestrating fetching historical data in the given time period (from-to dates)."""
    retry_options = df.RetryOptions(
        first_retry_interval_in_milliseconds=60000, max_number_of_attempts=3
    )
    orchestration_input: dict = context.get_input()
    if not context.is_replaying:
        logging.info("Orchestration input: \n{orchestration_input}")
    ids = orchestration_input["ids"]
    from_tos = orchestration_input["from_tos"]
    batch_size = orchestration_input.get("batch_size", 50)

    results = []
    tasks = []
    ids_per_fetches = orchestration_input.get("ids_per_fetches", 50)
    for idindex in range(0, len(ids), ids_per_fetches):
        tasks.append(
            context.call_activity_with_retry(
                "fetch_activity",
                retry_options,
                {
                    "id": ids[idindex : idindex + ids_per_fetches],
                    "from_tos": from_tos,
                },
            )
        )
        if len(tasks) >= batch_size:
            res = yield context.task_all(tasks)
            results.extend(res)
            tasks = []
    if tasks:
        res = yield context.task_all(tasks)
        results.extend(res)
    return results

main = df.Orchestrator.create(orchestrator_function)

Known workarounds

No workarounds, only pain.

App Details

Durable Functions extension version (e.g. v1.8.3): I don t really know how to figure this out. But we use pip package "azure-functions-durable==1.2.6", and extensionBundle version as given below.
Azure Functions runtime version (1.0 or 2.0): 4.25.2.2 (this is the number shown in the azure portal as "Runtime version").
Programming language used: Python

extensionBundle:

"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[3.15.0, 4.0.0)"
},

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: 2023-10-04T16:10:41Z to 2023-10-04T23:49:48Z
Azure region: Norway east
Orchestration instance ID(s): 10898cd93fa445e49ea0e7c11c14e8bb (or c5146e9bba4f45c5ae2093e233b88936 for an older and identical one)
Function App name: Preffer not to share
Function name(s): Preffer not to share
Azure storage account name: Preffer not to share

epa095 commented 1 year ago

Extra information: In many, but not all, of the cases the time between the started activity functions start increasing a lot. As an example, I have one running now where the fetchers are executed like this:

Scheduled	Finished
2023-10-06T02:18:08.4208488Z	2023-10-06T02:18:51.8422224Z
2023-10-06T03:43:52.9777909Z	2023-10-06T03:44:29.6155922Z
2023-10-06T07:46:19.8764234Z	2023-10-06T07:47:13.7343616Z

Notice how they are all done very fast, but there is hours between each one is started.

epa095 commented 1 year ago

New info: I changed the extensionBundle version from "[3.15.0, 4.0.0)" to:

  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[4.*, 5.0.0)"
  },

Currently it has run for some days without crashing, so maybe thats a fix? Although it still takes hours between each activity function, so it has not actually completed that many operations:-p

FelixRauch10 commented 10 months ago

Any news on this?

Azure / azure-functions-durable-extension