Closed Charles-Gagnon closed 9 months ago
@RohitRanjanMS Since you were assigned, checking to see if you are picking up this issue and/or actively working on it.
Discussed offline, @RohitRanjanMS is handling this issue.
I believe we may also be encountering this issue, or something similar at least.
We have a function that, in addition to the default logs produced by the worker, makes the following logs:
We have a dashboard using these logs and have found that we were consistently showing fewer "function completed" logs than expected (up to 1% of them are missing).
I've tried to reproduced this locally and not managed to have the logs missing entirely, but what I have found is the following. Where I might have expected to see this:
[2023-06-14T15:57:31.722Z] Executing 'Functions.myFunction' (Reason='This function was programmatically called via the host APIs.', Id=eaf11d4a-1722-475f-9818-8965d90b37c1)
[2023-06-14T15:57:31.725Z] {"name":"handler.received","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}
[2023-06-14T15:57:31.728Z] {"name":"http.request.end.success","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}
[2023-06-14T15:57:31.728Z] {"name":"handler.completed","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}
[2023-06-14T15:57:31.728Z] Executed 'Functions.myFunction' (Succeeded, Id=eaf11d4a-1722-475f-9818-8965d90b37c1, Duration=6ms)
Occasionally they appear as follows instead:
[2023-06-14T15:57:31.722Z] Executing 'Functions.myFunction' (Reason='This function was programmatically called via the host APIs.', Id=eaf11d4a-1722-475f-9818-8965d90b37c1)
[2023-06-14T15:57:31.725Z] {"name":"handler.received","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}
[2023-06-14T15:57:31.728Z] {"name":"http.request.end.success","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}[2023-06-14T15:57:31.728Z] {"name":"handler.completed","invocationId":"eaf11d4a-1722-475f-9818-8965d90b37c1"}
[2023-06-14T15:57:31.728Z] Executed 'Functions.myFunction' (Succeeded, Id=eaf11d4a-1722-475f-9818-8965d90b37c1, Duration=6ms)
I'm not sure whether this is just something that might happen when running locally or really representative of what would happen on the actual function. I'm also not sure whether this would lead to the log appended onto the single line to be lost altogether - I'd expect to still be able to find it all appended together as one log statement, but I can't.
Does this seem like it could be related? Could you suggest any further debugging steps I could take?
Version info: locally using azure func core tools 4.0.5198, which seems to depend on 4.21.1 On Azure, runtime version: 4.21.3.20740
Working on this with @RohitRanjanMS, we believe we have identified a race condition in the host that would explain this issue, escalating to host team to confirm.
Race condition is as follows: Log() method in GrpcWorkerChannel.cs depends on the ScriptInvocationContext being present in _executingInvocations, here: https://github.com/Azure/azure-functions-host/blob/8caea0f53dba65ea6651d2e0a4fd766cbd9786ba/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs#L1063C1-L1064C1
However, InvokeResponse removes the context from this collection simultaneously, here: https://github.com/Azure/azure-functions-host/blob/8caea0f53dba65ea6651d2e0a4fd766cbd9786ba/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs#L962
This leads to situations where the if condition in Log() is not met and the logs are not forwarded from the worker, either to the console or to Application Insights.
@fabiocav
@brettsam @fabiocav Has there been additional work on this? We still see this occasionally - lately with the Java worker.
@Charles-Gagnon -- yes, we know what the underlying issue is. I'm working on a fix that will likely roll out very slowly (not on by default, although I'll ping here when something is available to test).
It's in the hot path for function invocations, so we want to make sure we're taking everything into consideration as we improve this.
Just wondering if there is any update on this, it is hard to troubleshoot errors and 204 No Content
responses from Microsoft Graph API when I cannot rely on context.log(some_value)
to print out relevant values in the correct order or at all - sometimes, for example, the function shows as 'executed' in the terminal before the logs are printed. So it makes me question if there is something wrong with async responses I am awaiting on before the function completes etc.
We have a Function App triggered by Service Bus messages and developed in C#. Despite applying the following settings in the host.json file, we are still encountering issues with missing log entries in Azure Application Insights, especially during high message volumes in the Service Bus.
{ "version": "2.0", "logging": { "applicationInsights": { "samplingSettings": { "isEnabled": false }, "enableLiveMetricsFilters": true } } }
To address this concern, we have opened a support ticket (#2311140050002741). Any updates or insights on resolving this issue would be greatly appreciated.
Thank you.
I apologize for how slow this PR has gone -- but it's a pretty critical change in a very hot path. Coupled with slow releases and hotfixes over the last couple of months, we've been waiting for some other features to stabilize to prevent any bigger issues. The plan is to get this into our first release in the new year.
For those using dotnet-isolated, I'd highly recommend moving to use our Functions Worker Application Insights package. This shifts logging to go directly to App Insights rather than through the host, removing this issue of ordering completely.
See the readme above and also some other comments I've made for more info there:
Hello Team,
Although the mentioned issue has been marked as closed, it appears that the problem with missing log entries in function app persists. Kindly review my comment on this issue, which I posted last December 2023, and provide an update on the status.
Thank you.
Hi @samir080890 , your issue may be unrelated to this. Please try modifying max telemetry per second setting and see if it helps. The code is merged for this fix but the roll-out takes about 2 weeks' time.
This change has now been (mostly) rolled out to Windows production environments with version 4.30. However, it is not enabled by default just yet. We will be enabling this across different regions and ensuring there are no regressions over the next couple of weeks.
However, if you are experiencing issues with missing logs from workers (and/or out-of-order logs) and want to try it out faster -- you can add an App Setting of AzureWebJobsFeatureFlags
with a value of EnableOrderedInvocationMessages
. This will restart your app and it should begin using the newer logic to keep log messages in order.
@brettsam I'm still getting the same ordering issue albeit with way way less frequency.
Posted a question about the ordering scenario on stack overflow from way back then https://stackoverflow.com/questions/75716633/azure-function-context-log-not-logging-properly-not-synchronous along with a repository where you can easily test the ordering issue.
Currently only have repros for this using Function Core Tools - working on verifying whether this also happens to deployed functions with logs sent to app insights.
We have a simple function with a SQL trigger that we use in our tests : e.g. https://github.com/Azure/azure-functions-sql-extension/blob/release/trigger/samples/samples-js/ProductsTrigger/index.js
This function just outputs a log message, which the test then watches for to verify that the data the function was sent was correct. We do this by hooking into the OutputDataReceived event for the Process and looking at all the incoming messages.
https://github.com/Azure/azure-functions-sql-extension/blob/release/trigger/test/Integration/SqlTriggerBindingIntegrationTestBase.cs#L127
But occasionally we just never get that log message. We'll get other messages - such as "Function executed" - but the one our function logs just doesn't appear. We wait up to 10 seconds for this after which we give up.
Here's an example of the output logs we see with a missing message. It's a bit verbose - but you can see that there's 5 instances of the
Executed 'Functions.ProductsTriggerWithValidation'
message, but only 4SQL Changes:
.We also see 5 pairs of these messages :
which seem to imply that the host did get the log message, but something happened to it after that.