Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Non-Deterministic workflow detected [...] ID 0 #2617

Open epa095 opened 1 year ago

epa095 commented 1 year ago

Description

My python durable function runs for varying amount of time (sometimes a few hours, sometimes a few days), but always exits before its done with the error Non-Deterministic workflow detected: A previous execution of this orchestration scheduled an activity task with sequence ID 0 and name 'fetch_activity' (version ''), but the current replay execution hasn't (yet?) scheduled this task. Was a change made to the orchestrator code after this instance had already started running? I am reasonably sure that there is no nondeterminism in the orchestrator (see below), and it has not been redeployed while running. It is also always ID 0.

Relevant source code snippets

Here is the complete orchestrator:


import logging

import azure.durable_functions as df
import azure.functions as func

LOGGER = logging.getLogger(__name__)

def orchestrator_function(context: df.DurableOrchestrationContext):
    """Orchestrating fetching historical data in the given time period (from-to dates)."""
    retry_options = df.RetryOptions(
        first_retry_interval_in_milliseconds=60000, max_number_of_attempts=3
    )
    orchestration_input: dict = context.get_input()
    if not context.is_replaying:
        logging.info("Orchestration input: \n{orchestration_input}")
    ids = orchestration_input["ids"]
    from_tos = orchestration_input["from_tos"]
    batch_size = orchestration_input.get("batch_size", 50)

    results = []
    tasks = []
    ids_per_fetches = orchestration_input.get("ids_per_fetches", 50)
    for idindex in range(0, len(ids), ids_per_fetches):
        tasks.append(
            context.call_activity_with_retry(
                "fetch_activity",
                retry_options,
                {
                    "id": ids[idindex : idindex + ids_per_fetches],
                    "from_tos": from_tos,
                },
            )
        )
        if len(tasks) >= batch_size:
            res = yield context.task_all(tasks)
            results.extend(res)
            tasks = []
    if tasks:
        res = yield context.task_all(tasks)
        results.extend(res)
    return results

main = df.Orchestrator.create(orchestrator_function)

Known workarounds

No workarounds, only pain.

App Details

Related issues

https://github.com/Azure/azure-functions-durable-extension/issues/2563 maybe?

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

epa095 commented 1 year ago

Extra information: In many, but not all, of the cases the time between the started activity functions start increasing a lot. As an example, I have one running now where the fetchers are executed like this:

Scheduled Finished
2023-10-06T02:18:08.4208488Z 2023-10-06T02:18:51.8422224Z
2023-10-06T03:43:52.9777909Z 2023-10-06T03:44:29.6155922Z
2023-10-06T07:46:19.8764234Z 2023-10-06T07:47:13.7343616Z

Notice how they are all done very fast, but there is hours between each one is started.

epa095 commented 1 year ago

New info: I changed the extensionBundle version from "[3.15.0, 4.0.0)" to:

  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[4.*, 5.0.0)"
  },

Currently it has run for some days without crashing, so maybe thats a fix? Although it still takes hours between each activity function, so it has not actually completed that many operations:-p

FelixRauch10 commented 10 months ago

Any news on this?