Azure / Industrial-IoT

Azure Industrial IoT Platform
MIT License
521 stars 214 forks source link

OPC Publisher 2.8.2: Could not send worker heartbeat - eventually crashing and not restarting #1701

Closed gregeva closed 2 years ago

gregeva commented 2 years ago

Describe the bug Repeatedly receiving the error message "Could not send worker heartbeat. System.NullReferenceException: Object reference not set to an instance of an object." from the OPC Publisher (version 2.8.2).

[16:31:47 ERR Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker+JobProcess] Could not send worker heartbeat. 
System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Azure.IIoT.OpcUa.Edge.Publisher.Engine.WriterGroupMessageTrigger.get_AuthenticationUsername() in D:\a\1\s\components\opc-ua\src\Microsoft.Azure.IIoT.OpcUa.Edge.Publisher\src\Engine\WriterGroupMessageSource.cs:line 61
   at Microsoft.Azure.IIoT.OpcUa.Edge.Publisher.Engine.DataFlowProcessingEngine.GetDiagnosticInfo() in D:\a\1\s\components\opc-ua\src\Microsoft.Azure.IIoT.OpcUa.Edge.Publisher\src\Engine\DataFlowProcessingEngine.cs:line 171
   at Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker.JobProcess.GetProcessDiagnosticInfo() in D:\a\1\s\common\src\Microsoft.Azure.IIoT.Agent.Framework\src\Agent\Default\Worker.cs:line 374
   at Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker.JobProcess.SendHeartbeatAsync(CancellationToken ct) in D:\a\1\s\common\src\Microsoft.Azure.IIoT.Agent.Framework\src\Agent\Default\Worker.cs:line 465
   at Microsoft.Azure.IIoT.Agent.Framework.Agent.Worker.JobProcess.OnHeartbeatTimerAsync() in D:\a\1\s\common\src\Microsoft.Azure.IIoT.Agent.Framework\src\Agent\Default\Worker.cs:line 514

To Reproduce Steps to reproduce the behavior:

  1. I had subscribed to around 55 new OPC tags with a 500ms poll rate
  2. Updated changes received in the publisher and started scanning the OPC UA server
  3. Started receiving published updates in VS Code event hub monitoring
  4. OPC Publisher logs started throwing the worker heartbeat error
  5. Publisher module crashes after a few minutes
  6. Restarting the IoT Edge host does not restore the publisher module
  7. Running iotedge start publisher re-spawns the module, however it then crashes after a few minutes

Expected behavior I'd expect that:

Application Installation

gregeva commented 2 years ago

Subsequent testing has revealed that rolling back to version 2.8.0 using Helm chart 0.4.0 made this issue dissapear.

cristipogacean commented 2 years ago

@gregeva, thanks for the feedback! this seem to be a regression in 2.8.2 then. Could you give us some details on the job configuration stored in cosmosdb that causes this behavior? Especially on the opc ua connection authentication settings. How did you create the job configuration? was this something that was created with 2.8.2 using the engineering tool? or something different?

I think I have an idea on the issue on the publisher side, so the details above would speed up the investigation.

gregeva commented 2 years ago

You're welcome @cristipogacean . Here is what you're after... let me know what you find jobconfiguration.zip .

cristipogacean commented 2 years ago

@gregeva, thanks, this confirmed my original assumption. I just created the PR #1702 that should fix the issue.

It seems that the following section in your job config was causing this regression:

"user": {
   "type": "None"
}

In our job configuration the above section is left out completely, since it reflects the default setting: Anonymous Authentication. You can temporarily try to remove it or tune the API call to leave it out until 2.8.3 is released.

gregeva commented 2 years ago

Understood. Thanks for making quick work of this @cristipogacean !

I cannot change the API calls as this would require R&D and QA on our side which is not budgeted. I'll inform the customer and we'll hold back with 2.8.0 until 2.8.3 is released.

marcschier commented 2 years ago

In 2.8.3