Azure / Industrial-IoT

Azure Industrial IoT Platform
MIT License
519 stars 211 forks source link

OPC Publisher 2.9.8 can't startup when in docker container when cgroup v2 is used #2261

Open morti12 opened 2 weeks ago

morti12 commented 2 weeks ago

Describe the bug

On Linux machines where the cgroup v2 is used an exception gets thrown when the /sys/fs/cgroup/memory.max contains the default value max

Unhandled exception. Autofac.Core.DependencyResolutionException: An exception was thrown while activating Azure.IIoT.OpcUa.Publisher.Services.PublisherDiagnosticCollector -> λ:Microsoft.Extensions.Diagnostics.ResourceMonitoring.IResourceMonitor -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitorService -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationProvider.
---> Autofac.Core.DependencyResolutionException: An exception was thrown while invoking the constructor 'Void .ctor(Microsoft.Extensions.Options.IOptions`1[Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitoringOptions], Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.ILinuxUtilizationParser, System.Diagnostics.Metrics.IMeterFactory, System.TimeProvider)' on type 'LinuxUtilizationProvider'.
---> System.InvalidOperationException: Could not parse '/sys/fs/cgroup/memory.max' content. Expected to find available memory in bytes but got 'max
' instead.
   at Microsoft.Shared.Diagnostics.Throw.InvalidOperationException(String message)
   at Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationParserCgroupV2.GetAvailableMemoryInBytes()
   at Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationProvider..ctor(IOptions`1 options, ILinuxUtilizationParser parser, IMeterFactory meterFactory, TimeProvider timeProvider)
   at lambda_method223(Closure, Object[])
   at Autofac.Core.Activators.Reflection.BoundConstructor.Instantiate()
   --- End of inner exception stack trace ---
   at Autofac.Core.Activators.Reflection.BoundConstructor.Instantiate()
   at Autofac.Core.Activators.Reflection.ReflectionActivator.<>c__DisplayClass14_0.<UseSingleConstructorActivation>b__0(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Middleware.DelegateMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.DisposalTrackingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.ActivatorErrorHandlingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   --- End of inner exception stack trace ---
   at Autofac.Core.Resolving.Middleware.ActivatorErrorHandlingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Pipeline.ResolvePipeline.Invoke(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.RegistrationPipelineInvokeMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.SharingMiddleware.<>c__DisplayClass5_0.<Execute>b__0()
   at Autofac.Core.Lifetime.LifetimeScope.CreateSharedInstance(Guid id, Func`1 creator)
   at Autofac.Core.Lifetime.LifetimeScope.CreateSharedInstance(Guid primaryId, Nullable`1 qualifyingId, Func`1 creator)
   at Autofac.Core.Resolving.Middleware.SharingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.CircularDependencyDetectorMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Extensions.DependencyInjection.KeyedServiceMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Pipeline.ResolvePipeline.Invoke(ResolveRequestContext context)
   at Autofac.Core.Resolving.ResolveOperation.GetOrCreateInstance(ISharingLifetimeScope currentOperationScope, ResolveRequest& request)
   at Autofac.Core.Resolving.ResolveOperation.ExecuteOperation(ResolveRequest& request)
   at Autofac.Core.Resolving.ResolveOperation.Execute(ResolveRequest& request)
   at Autofac.Core.Lifetime.LifetimeScope.ResolveComponent(ResolveRequest& request)
   at Autofac.Core.Container.ResolveComponent(ResolveRequest& request)
   at Autofac.Core.Container.Autofac.IComponentContext.ResolveComponent(ResolveRequest& request)
   at Autofac.Builder.StartableManager.StartStartableComponents(IDictionary`2 properties, IComponentContext componentContext)
   at Autofac.ContainerBuilder.Build(ContainerBuildOptions options)
   at Autofac.Extensions.DependencyInjection.AutofacServiceProviderFactory.CreateServiceProvider(ContainerBuilder containerBuilder)
   at Microsoft.Extensions.Hosting.Internal.ServiceFactoryAdapter`1.CreateServiceProvider(Object containerBuilder)
   at Microsoft.Extensions.Hosting.HostBuilder.InitializeServiceProvider()
   at Microsoft.Extensions.Hosting.HostBuilder.Build()
   at Azure.IIoT.OpcUa.Publisher.Module.Program.RunAsync(String[] args, CancellationToken ct) in /__w/1/s/Industrial-IoT/src/Azure.IIoT.OpcUa.Publisher.Module/src/Program.cs:line 71
   at Azure.IIoT.OpcUa.Publisher.Module.Program.Main(String[] args) in /__w/1/s/Industrial-IoT/src/Azure.IIoT.OpcUa.Publisher.Module/src/Program.cs:line 59

This blocks the startup of the OPC Publisher.

We also tried to specify the max memory over the argument --memory=2g for the docker container. This temporally fixes the issue for the memory parsing, but then the cpu.max throws an error.

To Reproduce

Expected behavior

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="9.3 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="[https://www.redhat.com/"](https://www.redhat.com/%22)
DOCUMENTATION_URL="[https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9%22)
BUG_REPORT_URL="[https://bugzilla.redhat.com/"](https://bugzilla.redhat.com/%22)

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"

Docker

Docker version 26.1.4-1

Additional context Between OPC publisher versino 2.9.6 and 2.9.8 new metrics (services.AddResourceMonitoring();) were added which is actually nice 😊. This was done over a library called Microsoft.Extensions.Diagnostics.ResourceMonitoring. It seems that there is a bug in version 8.6.0 which can't handle the default value called max.

I created an issue for the bug in the library: https://github.com/dotnet/extensions/issues/5241

marcschier commented 2 weeks ago

Looks like workaround is to fallback to 8.4.0. That should be easy until it is fixed in an upcoming release.

marcschier commented 2 weeks ago

I reverted the package back, there is a preview build of 2.9.9 with 8.4.0 that has tag 2.9.9-preview2 (mcr.microsoft.com/iotedge/opc-publisher:2.9.9-preview2).

@morti12, it would be fantastic if you could try this (ASAP) and let me know if this fixes your issue? I would like to release 2.9.9 mid this week and this would help decide which build to go with. Otherwise (or there are CVE's introduced by the older version) we will go with 8.6.0 for the 2.9.9 release.

marcschier commented 2 weeks ago

Moving to 2.9.10 but waiting for validation on 2.9.9-preview2 that the issue is worked around.

morti12 commented 2 weeks ago

We tried it with the opc-publisher:2.9.9-preview2 image and we still get an error but a different one:

Unhandled exception. Autofac.Core.DependencyResolutionException: An exception was thrown while activating Azure.IIoT.OpcUa.Publisher.Services.PublisherDiagnosticCollector -> λ:Microsoft.Extensions.Diagnostics.ResourceMonitoring.IResourceMonitor -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitorService -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationProvider.
---> Autofac.Core.DependencyResolutionException: An exception was thrown while invoking the constructor 'Void .ctor(Microsoft.Extensions.Options.IOptions`1[Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitoringOptions], Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationParser, System.Diagnostics.Metrics.IMeterFactory, System.TimeProvider)' on type 'LinuxUtilizationProvider'.
---> System.IO.DirectoryNotFoundException: Could not find a part of the path '/sys/fs/cgroup/memory/memory.limit_in_bytes'.

We checked the version 8.4.0 of the Microsoft.Extensions.Diagnostics.ResourceMonitoring and it does not support cgroup v2. This was added in the version 8.5.0.

Would it be an option to just disable the ResourceMonitoring over an command line option until they fix this issue in the ResourceMonitor?

marcschier commented 2 weeks ago

That is a bummer. If you are ok with a switch, I will add this to 2.9.9.

morti12 commented 2 weeks ago

This would be really nice 👍 Thanks! I think the fix in the ResourceMonitoring library will take some time.

marcschier commented 2 weeks ago

@morti12, could you try out the "2.9.9-preview3" tag of opc-publisher container image, running it with the --dr command line option? It will disable resource monitoring. if it works for you, let me know, we will release 2.9.9 then this week (barring any unforeseen issues).

I will track this again under 2.9.10 and monitor the linked issue.

morti12 commented 1 week ago

@marcschier, we checked it and it works. The OPC Publisher 2.9.9-preview3 is starting up 💪 thanks 👍

marcschier commented 1 week ago

2.9.9 image has been released.

alxy commented 16 hours ago

I have just tried to upgrade as well to 2.9.9-preview3 in a fairly standard Iot edge setup (running on Ubuntu, IotEdge is installed as per documentation), and I got the same problem:

                                   2.9.9.3+926a31bdda (.NET 8.0.6/linux-x64/OPC Stack 1.5.374.70)

Unhandled exception. Autofac.Core.DependencyResolutionException: An exception was thrown while activating Azure.IIoT.OpcUa.Publisher.Services.PublisherDiagnosticCollector -> λ:Microsoft.Extensions.Diagnostics.ResourceMonitoring.IResourceMonitor -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitorService -> Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationProvider.
 ---> Autofac.Core.DependencyResolutionException: An exception was thrown while invoking the constructor 'Void .ctor(Microsoft.Extensions.Options.IOptions`1[Microsoft.Extensions.Diagnostics.ResourceMonitoring.ResourceMonitoringOptions], Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.ILinuxUtilizationParser, System.Diagnostics.Metrics.IMeterFactory, System.TimeProvider)' on type 'LinuxUtilizationProvider'.
 ---> System.InvalidOperationException: Could not parse '/sys/fs/cgroup/memory.max' content. Expected to find available memory in bytes but got 'max
' instead.
   at Microsoft.Shared.Diagnostics.Throw.InvalidOperationException(String message)
   at Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationParserCgroupV2.GetAvailableMemoryInBytes()
   at Microsoft.Extensions.Diagnostics.ResourceMonitoring.Linux.LinuxUtilizationProvider..ctor(IOptions`1 options, ILinuxUtilizationParser parser, IMeterFactory meterFactory, TimeProvider timeProvider)
   at lambda_method229(Closure, Object[])
   at Autofac.Core.Activators.Reflection.BoundConstructor.Instantiate()
   --- End of inner exception stack trace ---
   at Autofac.Core.Activators.Reflection.BoundConstructor.Instantiate()
   at Autofac.Core.Activators.Reflection.ReflectionActivator.<>c__DisplayClass14_0.<UseSingleConstructorActivation>b__0(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Middleware.DelegateMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.DisposalTrackingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.ActivatorErrorHandlingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   --- End of inner exception stack trace ---
   at Autofac.Core.Resolving.Middleware.ActivatorErrorHandlingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Pipeline.ResolvePipeline.Invoke(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.RegistrationPipelineInvokeMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.SharingMiddleware.<>c__DisplayClass5_0.<Execute>b__0()
   at Autofac.Core.Lifetime.LifetimeScope.CreateSharedInstance(Guid id, Func`1 creator)
   at Autofac.Core.Lifetime.LifetimeScope.CreateSharedInstance(Guid primaryId, Nullable`1 qualifyingId, Func`1 creator)
   at Autofac.Core.Resolving.Middleware.SharingMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Resolving.Middleware.CircularDependencyDetectorMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Extensions.DependencyInjection.KeyedServiceMiddleware.Execute(ResolveRequestContext context, Action`1 next)
   at Autofac.Core.Resolving.Pipeline.ResolvePipelineBuilder.<>c__DisplayClass14_0.<BuildPipeline>b__1(ResolveRequestContext context)
   at Autofac.Core.Pipeline.ResolvePipeline.Invoke(ResolveRequestContext context)
   at Autofac.Core.Resolving.ResolveOperation.GetOrCreateInstance(ISharingLifetimeScope currentOperationScope, ResolveRequest& request)
   at Autofac.Core.Resolving.ResolveOperation.ExecuteOperation(ResolveRequest& request)
   at Autofac.Core.Resolving.ResolveOperation.Execute(ResolveRequest& request)
   at Autofac.Core.Lifetime.LifetimeScope.ResolveComponent(ResolveRequest& request)
   at Autofac.Core.Container.ResolveComponent(ResolveRequest& request)
   at Autofac.Core.Container.Autofac.IComponentContext.ResolveComponent(ResolveRequest& request)
   at Autofac.Builder.StartableManager.StartStartableComponents(IDictionary`2 properties, IComponentContext componentContext)
   at Autofac.ContainerBuilder.Build(ContainerBuildOptions options)
   at Autofac.Extensions.DependencyInjection.AutofacServiceProviderFactory.CreateServiceProvider(ContainerBuilder containerBuilder)
   at Microsoft.Extensions.Hosting.Internal.ServiceFactoryAdapter`1.CreateServiceProvider(Object containerBuilder)
   at Microsoft.Extensions.Hosting.HostBuilder.InitializeServiceProvider()
   at Microsoft.Extensions.Hosting.HostBuilder.Build()
   at Azure.IIoT.OpcUa.Publisher.Module.Program.RunAsync(String[] args, CancellationToken ct) in /__w/1/s/Industrial-IoT/src/Azure.IIoT.OpcUa.Publisher.Module/src/Program.cs:line 71
   at Azure.IIoT.OpcUa.Publisher.Module.Program.Main(String[] args) in /__w/1/s/Industrial-IoT/src/Azure.IIoT.OpcUa.Publisher.Module/src/Program.cs:line 59

Do I need to do something else to get this release running?

morti12 commented 14 hours ago

In the preview 2.9.9-preview3 @marcschier introduced an cli argument / flag to disable the Resource Monitoring. You just need to add the --dr argument on the startup. Changes can be found here: https://github.com/Azure/Industrial-IoT/commit/926a31bddadfc7d6cf6422a16d703e741c3a9cd3