Closed janpiskur closed 3 years ago
Hi @janpiskur ,
Thank you so much for reaching out! I'm sorry you're having this issue. Unfortunately, there are two errors getting compounded in here. The first one is whatever logic flaw is causing your exception, and the second is that this error is getting swallowed and instead we're seeing this generic exception of "Object reference not set". The second error will make it rather difficult to fix the first one.
I just made a quick PR to fix the second error here and should be releasing a new Python Durable version just with it. If you could give us a few more days to make that release, I will ask you to re-run this reproducer with the newer version and to give us the more informative error that should propagate from there.
I'll get back to you as soon as the new release is out, which I'm hoping will be this Friday. Thank you for your patience!!
⚡ ⚡
Hi @janpiskur!
We just released Durable Functions for Python v1.0.0b9
and that should improve our error reporting capabilities. Do you mind reproducing this error with it and letting us know what new stack trace you see? Thank you!!
Hi @davidmrdavid
Thanks for your fast reply.
I get the error below (it also crashes the function afterwards):
{
name: "DurableFunctionsOrchestrator",
instanceId: "9b79908abde8465e92831f6c4ee8c797",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Non-Deterministic workflow detected: TaskScheduledEvent: 0 TaskScheduled Hello",
createdTime: "2020-09-21T05:49:23Z",
lastUpdatedTime: "2020-09-21T06:14:50Z"
}
Best regards, Jan
Hi @janpiskur ,
Thank you for the update. A quick follow-up: when you say you're "increasing the load" in order to trigger this error, what exactly does that mean here? How are you increasing it?
Hi @davidmrdavid,
When I increase the load, I read more data by widening the date interval given to the activity. So, if I pass the date interval “2019-09-01:2019-09-03” as a parameter to the activity inside the orchestrator, I read data from September 1 to September 3. Then I can widen this interval (increase the load) to for instance “2019-09-01:2019-09-05”, so it reads data from September 1 to September 5. Practically speaking, this means it’s going to read more files from the azure data lake. Does it make sense?
@janpiskur, that does make sense. I think I'll need to dive into your Azure logs to better understand what could be going on here. For that, do you mind providing me with the following information?
Timeframe when issue is observed in UTC: Function App name: Function name(s): Orchestration function: Region: Orchestration instance ID(s):
@janpiskur , I have one more favor to ask of you.
A Non-Deterministic workflow, for your orchestrator, most likely means that the input string you're providing (“2019-09-01:2019-09-05") is somehow not the same as the input string that the activity received. Would you mind logging in the activity the input string you received and double-checking that they indeed look the same? I wonder if they're somehow being changed over the wire
Hi @davidmrdavid,
I created a new function (bi-func-rps-garage-dev01) from scratch because the previous one crashed for some reason. Here is all the info:
Timeframe issue observed in UTC: From **2020-09-23 09:10:00** to **2020-09-23 09:33:00**
Function App name: bi-func-rps-garage-dev01
Function name(s): Hello (, DurableFunctionsHttpStart, DurableFunctionsOrchestrator)
Orchestration function: DurableFunctionsOrchestrator
Region: West Europe
Orchestration instance ID(s): 3c7cb2f7fb84407cb8d69ec6a5dbcd05
The orchestrator returns the following JSON:
{
id: "3c7cb2f7fb84407cb8d69ec6a5dbcd05",
statusQueryGetUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
sendEventPostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/raiseEvent/{eventName}?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
terminatePostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/terminate?reason={text}&taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
rewindPostUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05/rewind?reason={text}&taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w==",
purgeHistoryDeleteUri: "http://bi-func-rps-garage-dev01.azurewebsites.net/runtime/webhooks/durabletask/instances/3c7cb2f7fb84407cb8d69ec6a5dbcd05?taskHub=MyGarageTaskHub&connection=Storage&code=iyg2v6ZTHfw6XZmfAUUMYkSZnasIllvjw21HiafbiIT40RX3tr0N0w=="
}
The statusQueryGetUri gives the following JSON with a much longer stack trace this time:
{
name: "DurableFunctionsOrchestrator",
instanceId: "3c7cb2f7fb84407cb8d69ec6a5dbcd05",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Orchestrator function 'DurableFunctionsOrchestrator' failed: Activity function 'Hello' failed: OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem {"$type":"System.Exception, System.Private.CoreLib","ClassName":"System.Exception","Message":" OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem","Data":null,"InnerException":{"$type":"Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException, Microsoft.Azure.WebJobs.Script","StackTrace":" at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 93\n at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 86\n at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionInvoker.cs:line 52\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 591\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 537\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, IFunctionOutputDefinition outputDefinition, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 483\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 290","Message":"Result: Failure\nException: OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /home/site/wwwroot/.python_packages/lib/site-packages/certifi/cacert.pem\nStack: File \"/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py\", line 343, in _handle__invocation_request\n call_result = await self._loop.run_in_executor(\n File \"/usr/local/lib/python3.8/concurrent/futures/thread.py\", line 57, in run\n result = self.fn(*self.args, **self.kwargs)\n File \"/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py\", line 480, in __run_sync_func\n return func(**params)\n File \"/home/site/wwwroot/Hello/__init__.py\", line 58, in main\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_data_lake_file_client.py\", line 569, in download_file\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py\", line 83, in wrapper_use_tracer\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_blob_client.py\", line 662, in download_blob\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_download.py\", line 316, in __init__\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_download.py\", line 360, in _initial_request\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/blob/_generated/operations/_blob_operations.py\", line 174, in download\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 211, in run\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n [Previous line repeated 5 more times]\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/policies/_redirect.py\", line 157, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_shared/policies.py\", line 489, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/storage/filedatalake/_shared/policies.py\", line 290, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 71, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/_base.py\", line 103, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/pipeline/transport/_requests_basic.py\", line 259, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/sessions.py\", line 530, in request\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/sessions.py\", line 643, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/adapters.py\", line 416, in send\n File \"/home/site/wwwroot/.python_packages/lib/site-packages/requests/adapters.py\", line 227, in cert_verify\n","Data":{"$type":"System.Collections.ListDictionaryInternal, System.Private.CoreLib"},"InnerException":null,"HelpLink":null,"Source":"System.Private.CoreLib","HResult":-2146233088},"HelpURL":null,"StackTraceString":null,"RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":null,"HResult":-2146233088,"Source":null,"WatsonBuckets":null}",
createdTime: "2020-09-23T09:12:19Z",
lastUpdatedTime: "2020-09-23T09:31:15Z"
}
I noticed the error:
Could not find a suitable TLS CA certificate bundle
Best regards, Jan
Hi @janpiskur ,
Just wanted to check-in again. We're still investigating this issue. Two quick follow-ups:
(1) Did you manage to get around this error?
(2) When you were getting this error, where you using the WEBSITE_RUN_FROM_PACKAGE
setting?
Hi @davidmrdavid ,
Thanks for investigating this issue.
(1) No. (2) I am not quite sure what this setting does and I haven't actively pursued it, so my answer would most probably be no.
Hi @davidmrdavid ,
Just wanted to hear if there is any news or status update?
Best regards, Jan
Hi @janpiskur ,
I'm sorry this has taken a bit. I've been following-up internally and we may have identified other customers with the same problem. I can't share all the details right now, but now I'm hoping that the fact that others are seeing this will speed-up the investigation. Unfortunately this seems to be a problem at the platform-level so it's a bit outside my domain. Regardless, I'll get back to you as soon as I have updates to share, but I'm looking at this on the daily!
Hi @janpiskur ,
I've learned a bit more about this issue and it appears that a few of our linux customers have been hitting a filesystem mounting problem, one that's preventing your application from finding this certificate even though it is in the expected location.
We're currently working on a patch but, in the meantime, we appear to have a mitigation we can try. Before deploying to azure, is there any chance you could copy that cacert.pem
to the root of your application and attempt the solution described in this SO question: https://stackoverflow.com/questions/46119901/python-requests-cant-find-a-folder-with-a-certificate-when-converted-to-exe/47699138#47699138 ?
The way I'd try it is by writing
import sys, os
def override_where():
""" overrides certifi.core.where to return actual location of cacert.pem"""
# change this to match the location of cacert.pem
return os.path.abspath("cacert.pem") # or to whatever location you know contains the copy of cacert.pem
import certifi.core
os.environ["REQUESTS_CA_BUNDLE"] = override_where()
certifi.core.where = override_where
on top of your activity and orchestrator code files. You should be able to obtain a copy of cacert.pem
from the same directory as where certifi
is installted in your system. If you do a pip install certifi
then your terminal should give you that directory.
Please let me know if this mitigates your issue. I'm sorry this has been a blocker on our end.
Hi @davidmrdavid ,
I've tried what you suggested. Now I get the following error:
{
name: "DurableFunctionsOrchestrator",
instanceId: "2a70f26b625141fab23e41ddd3fd9955",
runtimeStatus: "Failed",
input: null,
customStatus: null,
output: "Orchestrator function 'DurableFunctionsOrchestrator' failed: Activity function 'Hello' failed: python exited with code 135 {"$type":"Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException, Microsoft.Azure.WebJobs.Script","StackTrace":" at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /src/azure-functions-host/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 93\n at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 83\n at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) in /src/azure-functions-host/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionInvoker.cs:line 52\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 541\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 486\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, IFunctionOutputDefinition outputDefinition, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 432\n at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 239","Message":"python exited with code 135\n ","Data":{"$type":"System.Collections.ListDictionaryInternal, System.Private.CoreLib"},"InnerException":null,"HelpLink":null,"Source":"System.Private.CoreLib","HResult":-2146233088}",
createdTime: "2020-10-20T12:29:04Z",
lastUpdatedTime: "2020-10-20T13:14:39Z"
}
Hmm this appears like a python-worker error.
@Hazhzeng , @vrdmr : Just FYI that this is the thread I'm following. The suggested mitigation of providing a specific path to the cacert.pem
does not appear to work in this case. Folks, does the trace in the comment above sound familiar to y'all? This user has been stuck without being able to run orchestrations for weeks now.
In the meantime @janpiskur , and for the sake of getting you unstuck, would deploying to a different app name / different resource mitigate this issue? That would be good to try, as I understand that this issue is only affecting a subset of apps
@janpiskur , another update: we believe this issue may be affecting VNET-enabled applications. Does your app have VNET enabled?
@davidmrdavid With regards to VNET: No, not as far as I know.
@janpiskur ,
Thanks for letting me know. I was trying to verify that internally but I couldn't access the latest dashboards for you application: bi-func-rps-garage-dev01. Did you delete the app by any chance? If not, could you try triggering this error again, without the cacert.pem fix, just so we get a new set of refreshed logs? It would also help us if you could provide the timerange for that new execution.
Hi @janpiskur , just a friendly ping on my message above and it's questions. We're still actively investigating this!
Hi @davidmrdavid
Sorry, I haven't replied. I am quite busy with other project right now.
Yes, I deleted the project because it was not really beeing used and it was just costing money.
I have created a new project now (bi-func-rps-garage.azurewebsites.net) by starting from scratch, but now nothing seems to work. For instance, I tried to follow the guide below with a simple hello world example (https://bi-func-rps-garage.azurewebsites.net/api/orchestrators/GarageOrchestrator), but the function just seems to run forever without returning a result:
Tutorial: https://docs.microsoft.com/da-dk/azure/azure-functions/durable/quickstart-python-vscode
Best regards, Jan
Hi @janpiskur,
Thank you for your repro. Using it, we did find a bug in Azure App Service where a race-condition was causing your app to fail to mount. The fix has been merged and should be going out in a few weeks.
I did find another exception in your app that I haven't been able to fully explain. It reads "NameError: name 'pathlib' is not defined
". The platform engineers want to know if your app is running locally still, as this looks like a missing dependency problem from our logs. If it is running locally without problem, then they'll investigate it further.
Hi @janpiskur ,
Just checking-in again. As I mentioned in my comment above, we believe you were hitting a bug in Azure App Service that has been merged and should already be available in some regions. The deployment will finalize in a few more days.
Since I know you left us a live app to help debug this, I wanted to give you to option to get a refund for those executions. Please consider opening a ticket here (https://azure.microsoft.com/en-us/support/create-ticket/) and, if you do, please let me know (feel free to send me any ticket identifying info to dajusto@microsoft.com
) and I'll make sure it gets routed to the right people. In the meantime, I'll close this issue since the patch is currently being deployed.
I'm sorry this took so long to diagnose, please reach out if we can be of help again.
Hi,
I write to you because I get the errors below and because it is hard to debug when the information is so sparse, so I hope you can help:
(1) Non-Deterministic workflow detected: TaskScheduledEvent: 0 TaskScheduled Hello (2) Object reference not set to an instance of an object.
I can’t tell you when these errors happen because they seem to happen so randomly. Currently, I get the “Object reference not set to an instance of an object” error after my script has run for 12 minutes or so.
Basically, I have a Python script (Python version 3.8) that loads gps data from a datalake into a pandas dataframe. Then an algorithm does some calculations on the gps data (NonMovingGPS class not included here). The script works fine if the load isn’t too high and if it runs locally as well. In other words, these errors usually show up if I increase the load. So far, I have mitigated the error by scaling up my app service plan, but I have reached a limit now. The error I get:
My http start:
Functions.json that belongs to http start:
My orchestrator:
Functions.json that belongs to the orchestrator:
My activity:
Functions.json that belongs to the activity:
My host.json: