Azure / azure-functions-durable-python

Python library for using the Durable Functions bindings.
MIT License
134 stars 54 forks source link

Non-Deterministic workflow and Object reference not set #194

Closed janpiskur closed 3 years ago

janpiskur commented 4 years ago

Hi,

I write to you because I get the errors below and because it is hard to debug when the information is so sparse, so I hope you can help:

(1) Non-Deterministic workflow detected: TaskScheduledEvent: 0 TaskScheduled Hello (2) Object reference not set to an instance of an object.

I can’t tell you when these errors happen because they seem to happen so randomly. Currently, I get the “Object reference not set to an instance of an object” error after my script has run for 12 minutes or so.

Basically, I have a Python script (Python version 3.8) that loads gps data from a datalake into a pandas dataframe. Then an algorithm does some calculations on the gps data (NonMovingGPS class not included here). The script works fine if the load isn’t too high and if it runs locally as well. In other words, these errors usually show up if I increase the load. So far, I have mitigated the error by scaling up my app service plan, but I have reached a limit now. The error I get:

{
name: "DurableFunctionsOrchestrator",
instanceId: "7b066caec5a54a30869830b42348b2d1",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Object reference not set to an instance of an object.",
createdTime: "2020-09-14T13:11:27Z",
lastUpdatedTime: "2020-09-14T13:38:43Z"
}

My http start:

import logging

import azure.functions as func
import azure.durable_functions as df

async def main(req: func.HttpRequest, starter: str) -> func.HttpResponse:
    client = df.DurableOrchestrationClient(starter)
    instance_id = await client.start_new(req.route_params["functionName"], None, None)

    logging.info(f"Started orchestration with ID = '{instance_id}'.")

    return client.create_check_status_response(req, instance_id)

Functions.json that belongs to http start:

{
  "scriptFile": "__init__.py",
  "bindings": [
    {
      "authLevel": "anonymous",
      "name": "req",
      "type": "httpTrigger",
      "direction": "in",
      "route": "orchestrators/{functionName}",
      "methods": [
        "post",
        "get"
      ]
    },
    {
      "name": "$return",
      "type": "http",
      "direction": "out"
    },
    {
      "name": "starter",
      "type": "orchestrationClient",
      "direction": "in"
    }
  ]
}

My orchestrator:

import logging
import json

import azure.functions as func
import azure.durable_functions as df

def orchestrator_function(context: df.DurableOrchestrationContext):
    result1 = yield context.call_activity('Hello', "2019-09-01:2019-09-03")
    return [result1]

main = df.Orchestrator.create(orchestrator_function)

Functions.json that belongs to the orchestrator:

{
  "scriptFile": "__init__.py",
  "bindings": [
    {
      "name": "context",
      "type": "orchestrationTrigger",
      "direction": "in"
    }
  ]
}

My activity:

from azure.storage.filedatalake import DataLakeServiceClient
from datetime import date, timedelta
import pandas as pd
import os
import re
import logging
import datetime
from io import BytesIO

from .shared_code.class_non_moving_gps import NonMovingGPS

def get_dates_between(sdate, edate):

    sdate = datetime.datetime.strptime(sdate, '%Y-%m-%d')
    edate = datetime.datetime.strptime(edate, '%Y-%m-%d')

    delta = edate - sdate

    days_list = []
    for i in range(delta.days + 1):
        day = sdate + timedelta(days=i)
        days_list.append(day.strftime("%Y-%m-%d"))

    return days_list

def main(dates: str) -> str:
    token = dates.split(':')
    positive_list = get_dates_between(token[0], token[1])

    service_client = DataLakeServiceClient.from_connection_string(conn_str="SECRET", file_system_name="rps")
    file_system_client = service_client.get_file_system_client(file_system="rps")
    paths = file_system_client.get_paths(path="vehicle-position/")

    paths_str = ''
    final_df = pd.DataFrame(columns=['SentUtcTime','Latitude','Longitude','VehicleId'])
    counter = 0
    for path in paths:
        p = re.compile(r'^vehicle-position\/(\d{4})\/(\d{6})\/(\d{8})')
        m = p.match(str(os.path.dirname(path.name)))
        if m:
            year = m.group(1)
            month = m.group(2)[4:6]
            date = m.group(3)[6:8]
            date_str = year + '-' + month + '-' + date
            if date_str in positive_list:
                paths_str = paths_str + path.name + '\n'

                file_client = service_client.get_file_client(file_system="rps", file_path=path.name)
                download = file_client.download_file()
                content = download.readall()
                stream = BytesIO(content)
                df_temp = pd.read_csv(stream, compression='gzip')
                logging.info(f"{path.name}: {df_temp.shape}")

                df_temp = df_temp[df_temp['Latitude'].notna() & df_temp['Longitude'].notna() & df_temp['VehicleId'].notna() & df_temp['SentUtcTime'].notna()][['SentUtcTime','Latitude','Longitude','VehicleId']].reset_index(drop=True)
                df_temp['VehicleId'] = df_temp['VehicleId'].astype('int32')
                df_temp['SentUtcTime'] = pd.to_datetime(df_temp['SentUtcTime'])

                final_df = final_df.append(df_temp, ignore_index=True)

                counter = counter + 1
        if counter > 0:
            pass#break

    unique_vehicles = list(set(final_df.VehicleId.unique()))

    df_spots = pd.DataFrame(columns=['ParkDateTime', 'WakeupDateTime', 'DiffMin', 'Latitude', 'Longitude', 'VehicleNumber'])
    for vehicle in unique_vehicles:
        non_moving_gps = NonMovingGPS()
        tmp = non_moving_gps.loop_single_vehicle(final_df, vehicle)
        df_spots.append(tmp, ignore_index=True)
    return f"Hi {dates}! Here are your paths: {df_spots}"

Functions.json that belongs to the activity:

{
  "scriptFile": "__init__.py",
  "bindings": [
    {
      "name": "dates",
      "type": "activityTrigger",
      "direction": "in"
    }
  ]
}

My host.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "MyGarageTaskHub"
    }
  },
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    }
  },
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[1.*, 2.0.0)"
  },
  "functionTimeout": "-1"
}
davidmrdavid commented 4 years ago

Hi @janpiskur ,

Thank you so much for reaching out! I'm sorry you're having this issue. Unfortunately, there are two errors getting compounded in here. The first one is whatever logic flaw is causing your exception, and the second is that this error is getting swallowed and instead we're seeing this generic exception of "Object reference not set". The second error will make it rather difficult to fix the first one.

I just made a quick PR to fix the second error here and should be releasing a new Python Durable version just with it. If you could give us a few more days to make that release, I will ask you to re-run this reproducer with the newer version and to give us the more informative error that should propagate from there.

I'll get back to you as soon as the new release is out, which I'm hoping will be this Friday. Thank you for your patience!!

⚡ ⚡

davidmrdavid commented 4 years ago

Hi @janpiskur!

We just released Durable Functions for Python v1.0.0b9 and that should improve our error reporting capabilities. Do you mind reproducing this error with it and letting us know what new stack trace you see? Thank you!!

janpiskur commented 4 years ago

Hi @davidmrdavid

Thanks for your fast reply.

I get the error below (it also crashes the function afterwards):

{
name: "DurableFunctionsOrchestrator",
instanceId: "9b79908abde8465e92831f6c4ee8c797",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Non-Deterministic workflow detected: TaskScheduledEvent: 0 TaskScheduled Hello",
createdTime: "2020-09-21T05:49:23Z",
lastUpdatedTime: "2020-09-21T06:14:50Z"
}

Best regards, Jan

davidmrdavid commented 4 years ago

Hi @janpiskur ,

Thank you for the update. A quick follow-up: when you say you're "increasing the load" in order to trigger this error, what exactly does that mean here? How are you increasing it?

janpiskur commented 4 years ago

Hi @davidmrdavid,

When I increase the load, I read more data by widening the date interval given to the activity. So, if I pass the date interval “2019-09-01:2019-09-03” as a parameter to the activity inside the orchestrator, I read data from September 1 to September 3. Then I can widen this interval (increase the load) to for instance “2019-09-01:2019-09-05”, so it reads data from September 1 to September 5. Practically speaking, this means it’s going to read more files from the azure data lake. Does it make sense?

davidmrdavid commented 4 years ago

@janpiskur, that does make sense. I think I'll need to dive into your Azure logs to better understand what could be going on here. For that, do you mind providing me with the following information?

Timeframe when issue is observed in UTC: Function App name: Function name(s): Orchestration function: Region: Orchestration instance ID(s):

davidmrdavid commented 4 years ago

@janpiskur , I have one more favor to ask of you.

A Non-Deterministic workflow, for your orchestrator, most likely means that the input string you're providing (“2019-09-01:2019-09-05") is somehow not the same as the input string that the activity received. Would you mind logging in the activity the input string you received and double-checking that they indeed look the same? I wonder if they're somehow being changed over the wire

janpiskur commented 4 years ago

Hi @davidmrdavid,

I created a new function (bi-func-rps-garage-dev01) from scratch because the previous one crashed for some reason. Here is all the info:

Timeframe issue observed in UTC: From **2020-09-23 09:10:00** to **2020-09-23 09:33:00**
Function App name: bi-func-rps-garage-dev01
Function name(s): Hello (, DurableFunctionsHttpStart, DurableFunctionsOrchestrator)
Orchestration function: DurableFunctionsOrchestrator
Region: West Europe
Orchestration instance ID(s): 3c7cb2f7fb84407cb8d69ec6a5dbcd05

The orchestrator returns the following JSON:

{
id: "3c7cb2f7fb84407cb8d69ec6a5dbcd05",
statusQueryGetUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
sendEventPostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/raiseEvent/{eventName}?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
terminatePostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/terminate?reason={text}&taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
rewindPostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/rewind?reason={text}&taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
purgeHistoryDeleteUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w=="
}

The statusQueryGetUri gives the following JSON with a much longer stack trace this time:

{
name: "DurableFunctionsOrchestrator",
instanceId: "3c7cb2f7fb84407cb8d69ec6a5dbcd05",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Orchestrator function 'DurableFunctionsOrchestrator' failed: Activity function 'Hello' failed: OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem {"$type":"System.Exception, System.Private.CoreLib","ClassName":"System.Exception","Message":" OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem","Data":null,"InnerException":{"$type":"Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException, Microsoft.Azure.WebJobs.Script","StackTrace":" at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 93\n at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 86\n at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionInvoker.cs:line 52\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 591\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 537\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, IFunctionOutputDefinition outputDefinition, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 483\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 290","Message":"Result: Failure\nException: OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem\nStack: File \"/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py\", line 343, in _handle__invocation_request\n call_result = await self._loop.run_in_executor(\n File \"/usr/local/lib/python3.8/concurrent/futures/thread.py\", line 57, in run\n result = self.fn(*self.args, **self.kwargs)\n File \"/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py\", line 480, in __run_sync_func\n return func(**params)\n File \"/home/site/wwwroot/Hello/__init__.py\", line 58, in main\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_data_lake_file_client.py\", line 569, in download_file\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py\", line 83, in wrapper_use_tracer\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_blob_client.py\", line 662, in download_blob\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_download.py\", line 316, in __init__\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_download.py\", line 360, in _initial_request\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_generated/operations/_blob_operations.py\", line 174, in download\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 211, in run\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n [Previous line repeated 5 more times]\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/policies/_redirect.py\", line 157, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_shared/policies.py\", line 489, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_shared/policies.py\", line 290, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 103, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/transport/_requests_basic.py\", line 259, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/sessions.py\", line 530, in request\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/sessions.py\", line 643, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/adapters.py\", line 416, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/adapters.py\", line 227, in cert_verify\n","Data":{"$type":"System.Collections.ListDictionaryInternal, System.Private.CoreLib"},"InnerException":null,"HelpLink":null,"Source":"System.Private.CoreLib","HResult":-2146233088},"HelpURL":null,"StackTraceString":null,"RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":null,"HResult":-2146233088,"Source":null,"WatsonBuckets":null}",
createdTime: "2020-09-23T09:12:19Z",
lastUpdatedTime: "2020-09-23T09:31:15Z"
}

I noticed the error:

Could not find a suitable TLS CA certificate bundle

Best regards, Jan

davidmrdavid commented 4 years ago

Hi @janpiskur ,

Just wanted to check-in again. We're still investigating this issue. Two quick follow-ups: (1) Did you manage to get around this error? (2) When you were getting this error, where you using the WEBSITE_RUN_FROM_PACKAGE setting?

janpiskur commented 4 years ago

Hi @davidmrdavid ,

Thanks for investigating this issue.

(1) No. (2) I am not quite sure what this setting does and I haven't actively pursued it, so my answer would most probably be no.

janpiskur commented 3 years ago

Hi @davidmrdavid ,

Just wanted to hear if there is any news or status update?

Best regards, Jan

davidmrdavid commented 3 years ago

Hi @janpiskur ,

I'm sorry this has taken a bit. I've been following-up internally and we may have identified other customers with the same problem. I can't share all the details right now, but now I'm hoping that the fact that others are seeing this will speed-up the investigation. Unfortunately this seems to be a problem at the platform-level so it's a bit outside my domain. Regardless, I'll get back to you as soon as I have updates to share, but I'm looking at this on the daily!

davidmrdavid commented 3 years ago

Hi @janpiskur ,

I've learned a bit more about this issue and it appears that a few of our linux customers have been hitting a filesystem mounting problem, one that's preventing your application from finding this certificate even though it is in the expected location.

We're currently working on a patch but, in the meantime, we appear to have a mitigation we can try. Before deploying to azure, is there any chance you could copy that cacert.pem to the root of your application and attempt the solution described in this SO question: https://stackoverflow.com/questions/46119901/python-requests-cant-find-a-folder-with-a-certificate-when-converted-to-exe/47699138#47699138 ?

The way I'd try it is by writing

import sys, os

def override_where():
    """ overrides certifi.core.where to return actual location of cacert.pem"""
    # change this to match the location of cacert.pem
    return os.path.abspath("cacert.pem") # or to whatever location you know contains the copy of cacert.pem

 import certifi.core

os.environ["REQUESTS_CA_BUNDLE"] = override_where()
certifi.core.where = override_where

on top of your activity and orchestrator code files. You should be able to obtain a copy of cacert.pem from the same directory as where certifi is installted in your system. If you do a pip install certifi then your terminal should give you that directory.

Please let me know if this mitigates your issue. I'm sorry this has been a blocker on our end.

janpiskur commented 3 years ago

Hi @davidmrdavid ,

I've tried what you suggested. Now I get the following error:

{
name: "DurableFunctionsOrchestrator",
instanceId: "2a70f26b625141fab23e41ddd3fd9955",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Orchestrator function 'DurableFunctionsOrchestrator' failed: Activity function 'Hello' failed: python exited with code 135 {"$type":"Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException, Microsoft.Azure.WebJobs.Script","StackTrace":" at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 93\n at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 83\n at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionInvoker.cs:line 52\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 541\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 486\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, IFunctionOutputDefinition outputDefinition, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 432\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 239","Message":"python exited with code 135\n ","Data":{"$type":"System.Collections.ListDictionaryInternal, System.Private.CoreLib"},"InnerException":null,"HelpLink":null,"Source":"System.Private.CoreLib","HResult":-2146233088}",
createdTime: "2020-10-20T12:29:04Z",
lastUpdatedTime: "2020-10-20T13:14:39Z"
}
davidmrdavid commented 3 years ago

Hmm this appears like a python-worker error. @Hazhzeng , @vrdmr : Just FYI that this is the thread I'm following. The suggested mitigation of providing a specific path to the cacert.pem does not appear to work in this case. Folks, does the trace in the comment above sound familiar to y'all? This user has been stuck without being able to run orchestrations for weeks now.

In the meantime @janpiskur , and for the sake of getting you unstuck, would deploying to a different app name / different resource mitigate this issue? That would be good to try, as I understand that this issue is only affecting a subset of apps

davidmrdavid commented 3 years ago

@janpiskur , another update: we believe this issue may be affecting VNET-enabled applications. Does your app have VNET enabled?

janpiskur commented 3 years ago

@davidmrdavid With regards to VNET: No, not as far as I know.

davidmrdavid commented 3 years ago

@janpiskur ,

Thanks for letting me know. I was trying to verify that internally but I couldn't access the latest dashboards for you application: bi-func-rps-garage-dev01. Did you delete the app by any chance? If not, could you try triggering this error again, without the cacert.pem fix, just so we get a new set of refreshed logs? It would also help us if you could provide the timerange for that new execution.

davidmrdavid commented 3 years ago

Hi @janpiskur , just a friendly ping on my message above and it's questions. We're still actively investigating this!

janpiskur commented 3 years ago

Hi @davidmrdavid

Sorry, I haven't replied. I am quite busy with other project right now.

Yes, I deleted the project because it was not really beeing used and it was just costing money.

I have created a new project now (bi-func-rps-garage.azurewebsites.net) by starting from scratch, but now nothing seems to work. For instance, I tried to follow the guide below with a simple hello world example (https://bi-func-rps-garage.azurewebsites.net/api/orchestrators/GarageOrchestrator), but the function just seems to run forever without returning a result:

Example: http://bi-func-rps-garage.azurewebsites.net/runtime/webhooks/durabletask/instances/9d30ad421fb9493aa62f036a1ee91833?taskHub=DurableFunctionsHub&connection=Storage&code=Beti28rdDClIEnDQCN4t65mFBmMZ8JrT/qAP7HYg3cv7naRXsryW8Q==

Tutorial: https://docs.microsoft.com/da-dk/azure/azure-functions/durable/quickstart-python-vscode

Best regards, Jan

davidmrdavid commented 3 years ago

Hi @janpiskur,

Thank you for your repro. Using it, we did find a bug in Azure App Service where a race-condition was causing your app to fail to mount. The fix has been merged and should be going out in a few weeks.

I did find another exception in your app that I haven't been able to fully explain. It reads "NameError: name 'pathlib' is not defined". The platform engineers want to know if your app is running locally still, as this looks like a missing dependency problem from our logs. If it is running locally without problem, then they'll investigate it further.

davidmrdavid commented 3 years ago

Hi @janpiskur ,

Just checking-in again. As I mentioned in my comment above, we believe you were hitting a bug in Azure App Service that has been merged and should already be available in some regions. The deployment will finalize in a few more days.

Since I know you left us a live app to help debug this, I wanted to give you to option to get a refund for those executions. Please consider opening a ticket here (https://azure.microsoft.com/en-us/support/create-ticket/) and, if you do, please let me know (feel free to send me any ticket identifying info to dajusto@microsoft.com) and I'll make sure it gets routed to the right people. In the meantime, I'll close this issue since the patch is currently being deployed.

I'm sorry this took so long to diagnose, please reach out if we can be of help again.