dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.71k forks source link

rhel8 arm64 throws NullReferenceExceptions #43349

Closed tmds closed 3 years ago

tmds commented 4 years ago

In our CI builds, each run on RHEL8 arm64 shows NullReferenceExceptions in the log.

On the same arm64 host with a Fedora 32 VM there are no NullReferenceExceptions. When I build and test on another RHEL8 arm64 machine, NullReferenceExceptions also show up in unexpected places.

Some example stack traces from CI log:

Microsoft.Extensions.Hosting tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/coreclr/src/System.Private.CoreLib/src/System/Array.CoreCLR.cs(521,0): at System.SZArrayHelper.GetEnumerator[T]()
          /home/tester/runtime/src/libraries/System.Linq/src/System/Linq/Single.cs(136,0): at System.Linq.Enumerable.SingleOrDefault[TSource](IEnumerable`1 source, Func`2 predicate)
             at System.Reflection.NetCoreReflectionExtensions.GetConstructor(Type type, BindingFlags bindingAttr, Object binder, Type[] types, Object[] modifiers)
             at Castle.DynamicProxy.Generators.InterfaceProxyWithTargetGenerator.EnsureValidBaseType(Type type)
             at Castle.DynamicProxy.Generators.InterfaceProxyWithTargetGenerator.GenerateCode(Type proxyTargetType, Type[] interfaces, ProxyGenerationOptions options)
             at Castle.DynamicProxy.DefaultProxyBuilder.CreateInterfaceProxyTypeWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options)
             at Castle.DynamicProxy.ProxyGenerator.CreateInterfaceProxyTypeWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options)
             at Castle.DynamicProxy.ProxyGenerator.CreateInterfaceProxyWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options, IInterceptor[] interceptors)
             at Moq.CastleProxyFactory.CreateProxy(Type mockType, IInterceptor interceptor, Type[] interfaces, Object[] arguments)
             at Moq.Mock`1.InitializeInstance()
             at Moq.Mock`1.OnGetObject()
             at Moq.Mock.get_Object()
             at Moq.Mock`1.get_Object()
          /home/tester/runtime/src/libraries/Microsoft.Extensions.Hosting/tests/UnitTests/Internal/HostTests.cs(583,0): at Microsoft.Extensions.Hosting.Internal.HostTests.<>c__DisplayClass22_0.<HostStopAsyncCanBeCancelledEarly>b__3(IServiceCollection services)
          /home/tester/runtime/src/libraries/Microsoft.Extensions.Hosting/src/HostingHostBuilderExtensions.cs(121,0): at Microsoft.Extensions.Hosting.HostingHostBuilderExtensions.<>c__DisplayClass7_0.<ConfigureServices>b__0(HostBuilderContext context, 

System.Linq.Parallel.Tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/libraries/System.Linq.Parallel/src/System/Linq/Parallel/Enumerables/ParallelQuery.cs(104,0): at System.Linq.ParallelQuery`1.Cast[TCastTo]()
          /home/tester/runtime/src/libraries/System.Linq.Parallel/src/System/Linq/ParallelEnumerable.cs(5271,0): at System.Linq.ParallelEnumerable.Cast[TResult](ParallelQuery source)
          /home/tester/runtime/src/libraries/System.Linq.Parallel/tests/QueryOperators/CastTests.cs(105,0): at System.Linq.Parallel.Tests.CastTests.Cast_Empty(Labeled`1 labeled, Int32 count)

System.Text.Json.Serialization.Tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/libraries/System.Text.Json/tests/Serialization/SerializationWrapper.cs(104,0): at System.Text.Json.Serialization.Tests.SerializationWrapper.WriterSerializerWrapper.SerializeWrapper[T](T value, JsonSerializerOptions options)
          /home/tester/runtime/src/libraries/System.Text.Json/tests/Serialization/PolymorphicTests.cs(125,0): at System.Text.Json.Serialization.Tests.PolymorphicTests.ArrayAsRootObject()
          --- End of stack trace from previous location ---

@janvorli I don't know how to debug this, can you take a look? or give me some pointers?

cc @omajid

janvorli commented 3 years ago

@tmds I believe the issue doesn't occur if you have 4kB large memory pages, only when the distro has larger pages, the block with the cookie "leaks" into code.

mangod9 commented 3 years ago

@tmds @janvorli is any fix required here for .net 6?

tmds commented 3 years ago

is any fix required here for .net 6?

Yes.

The cookie issue still causes our builds to fail from the start. Once that is fixed, and it has rippled into the SDK that gets used from the .dotnet folder, I suspect we'll see the NullReferenceExceptions again.

Our plan is to build .NET 6 for arm64, but this issue needs to be resolved for that.

I've looked at the problem but I couldn't figure out the root cause. I think it is in the kernel. I can ask kernel engineers to have a look, but they'll want al better reproducer.

crummel commented 3 years ago

@janvorli Now that preview6 is wrapping up, any idea on when you'll be able to take another look at this?

janvorli commented 3 years ago

I have created a PR in arcade to fix rootfs build for Alpine 3.9. After consulting it with @mthalman, I am going to get in my original change to the docker images and keep building for Alpine on 3.9 for now and move to using the 3.13 after the preview 7. Then I can get in my change to use the lld linker and start looking into the null reference issues. We still have null reference issues on Apple Silicon, so chances are they are related.

janvorli commented 3 years ago

@omajid, @tmds I have tried to run all coreclr pri 1 tests on RHEL 8 with 64kB page size using the latest main and no tests were failing with NullReferenceException anymore. I had to run the tests manually (enumerating all of the related .sh files and running them with added -coreroot argument), since the Preview 6 SDK / runtime that's normally used to execute xunit doesn't have the fix for the GS cookie mapping issue that I've fixed recently by switching to the lld linker. Out of all the coreclr pri 1 tests, 10052 succeeded, 29 failed and 3 timed out. 15 of the failures are Unhandled exception. System.InvalidProgramException: Vararg calling convention not supported., few were caused by the testing methodology (some tests can properly run only via xunit) and the remaining failures are of unknown kind (but no crashes, just error codes meaning the test didn't pass as expected). So I am closing this issue.

omajid commented 3 years ago

Thanks, @janvorli ! Any idea when a fix might land such that building runtime works out of the box? Maybe in a month or so?

tmds commented 3 years ago

I have tried to run all coreclr pri 1 tests on RHEL 8 with 64kB page size using the latest main and no tests were failing with NullReferenceException anymore.

I'm not sure you're running tests in a way that shows the NullReferenceException issue is fixed. When I ran these tests before none throwed NullReferenceException (https://github.com/dotnet/runtime/issues/43349#issuecomment-757922450). The exceptions happend as part of running the library tests.

The NullReferenceExceptions were happening before we started hitting the GSCookie issue. It's clear https://github.com/dotnet/runtime/pull/52244 fixes the GSCookie issue (https://github.com/dotnet/runtime/issues/43349#issuecomment-807867988), but I don't understand how it fixes the NullReferenceExceptions.

janvorli commented 3 years ago

I believe the NullReferenceException was fixed by another change, #53510. That was what was causing those on macOS arm64 and it was not Apple specific.

janvorli commented 3 years ago

Any idea when a fix might land such that building runtime works out of the box?

The fix will be part of RC1, which will come after preview 7.

tmds commented 3 years ago

I believe the NullReferenceException was fixed by another change, #53510. That was what was causing those on macOS arm64 and it was not Apple specific.

Great! Thank you for the reference.