RenderKit / embree

Embree ray tracing kernels repository.
Apache License 2.0
2.39k stars 390 forks source link

Debug build has alignment issue with AVX512 #342

Open pborsutzki opened 2 years ago

pborsutzki commented 2 years ago

Hi,

I built a windows debug version of embree 3.13.1 using these commands:

cmake -G "Visual Studio 16 2019" -T "Intel C++ Compiler 19.2" ^
  -DTBB_ROOT="E:/tmp/embree/tbb2017_20161128oss_win/tbb2017_20161128oss/" ^
  -DEMBREE_ISPC_EXECUTABLE="C:/Program Files/ISPC/ispc-v1.16.1-windows/bin/ispc.exe" ^
  -DEMBREE_STACK_PROTECTOR=ON -DEMBREE_TUTORIALS=ON -DEMBREE_MAX_ISA=AVX512 ^
  -DCMAKE_INSTALL_PREFIX=install ..
cmake --build . --config Debug -- /m /nologo /verbosity:n

Please forgive me for the old tbb version (I doubt it is relevant, but it is complicated to upgrade it in other dependencies of our project).

Now trying to execute one of the examples on an AVX512 machine (Skylake, Intel(R) Xeon(R) Gold 5120) leads consequently to crashes with callstacks ending in vboolf<4> constructors:

>   embree3.dll!embree::vboolf_impl<4>::vboolf_impl(__m128 input) Line 37   C++
    embree3.dll!embree::operator!(embree::vboolf_impl<4> *, const embree::vboolf_impl<4> & a) Line 75   C++
    embree3.dll!embree::operator<=(embree::vboolf_impl<4> *, const embree::vint_impl<4> & a, const embree::vint_impl<4> & b) Line 343   C++
    embree3.dll!embree::avx512::intersectNode(const embree::AABBNode_t<embree::NodeRefPtr<4>, 4> * node, const embree::avx512::TravRay<4, 0> & ray, embree::vfloat_impl<4> & dist) Line 436 C++
    embree3.dll!embree::avx512::BVHNNodeIntersector1<4, 1, 0>::intersect(const embree::NodeRefPtr<4> & node, const embree::avx512::TravRay<4, 0> & ray, float time, embree::vfloat_impl<4> & dist, unsigned __int64 & mask) Line 1213   C++
    embree3.dll!embree::avx512::BVHNIntersector1<4, 1, 0, embree::avx512::ArrayIntersector1<embree::avx512::TriangleMIntersector1Moeller<4, 1> > >::intersect(const embree::Accel::Intersectors * This, embree::RayHitK<1> & ray, embree::IntersectContext * context) Line 87   C++
    embree3.dll!embree::Accel::Intersectors::intersect(RTCRayHit & ray, embree::IntersectContext * context) Line 308    C++
    embree3.dll!rtcIntersect1(RTCSceneTy * hscene, RTCIntersectContext * user_context, RTCRayHit * rayhit) Line 460 C++
    triangle_geometry.exe!embree::renderPixelStandard(const embree::TutorialData & data, int x, int y, int * pixels, const unsigned int width, const unsigned int height, const float time, const embree::Camera::ISPCCamera & camera, embree::RayStats & stats) Line 127   C++
    triangle_geometry.exe!embree::renderTileTask(int taskIndex, int threadIndex, int * pixels, const unsigned int width, const unsigned int height, const float time, const embree::Camera::ISPCCamera & camera, const int numTilesX, const int numTilesY) Line 174 C++
    triangle_geometry.exe!<lambda_0>::operator()(const embree::range<size_t> & range) Line 190  C++
    triangle_geometry.exe!<lambda_11>::operator()(const tbb::blocked_range<size_t> & r) Line 73 C++
    triangle_geometry.exe!tbb::interface9::internal::start_for<tbb::blocked_range<size_t>, lambda [] type at line 459075, col. 73, const tbb::auto_partitioner>::run_body(tbb::blocked_range<size_t> & r) Line 102  C++
    triangle_geometry.exe!tbb::interface9::internal::balancing_partition_type<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type> >::work_balance(tbb::interface9::internal::start_for<tbb::blocked_range<size_t>, lambda [] type at line 459075, col. 73, const tbb::auto_partitioner> & start, tbb::blocked_range<size_t> & range) Line 444  C++
    triangle_geometry.exe!tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute(tbb::interface9::internal::start_for<tbb::blocked_range<size_t>, lambda [] type at line 459075, col. 73, const tbb::auto_partitioner> & start, tbb::blocked_range<size_t> & range) Line 256   C++
    triangle_geometry.exe!tbb::interface9::internal::start_for<tbb::blocked_range<size_t>, lambda [] type at line 459075, col. 73, const tbb::auto_partitioner>::execute() Line 128 C++
    tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) Line 501  C++
    tbb_debug.dll!tbb::tbb::internal::arena::process(tbb::internal::generic_scheduler & s) Line 159 C++
    tbb_debug.dll!tbb::internal::tbb::internal::market::process(rml::job & j) Line 678  C++
    tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::run() Line 271 C++
    tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::thread_routine(void * arg) Line 225    C++
    msvcr110d.dll!_callthreadstartex() Line 354 C
    msvcr110d.dll!_threadstartex(void * ptd) Line 337   C
    kernel32.dll!BaseThreadInitThunk() Unknown
    ntdll.dll!RtlUserThreadStart() Unknown

The relevant assembly is this (constructor of `vboolf<4>):

    __forceinline vboolf(__m128 input) : v(input) {}
 push        rbp  
 sub         rsp,20h  
 lea         rbp,[rsp+20h]  
 mov         qword ptr [rsp],rax  
 mov         rax,1Ch  
 mov         dword ptr [rsp+rax],0CCCCCCCCh  
 sub         rax,4  
 cmp         rax,4  
 jg          embree::vboolf_impl<4>::vboolf_impl+15h (07FFB1A7D835Dh)  
 mov         rax,qword ptr [rsp]  
 mov         dword ptr [rsp],0CCCCCCCCh  
 mov         dword ptr [rsp+4],0CCCCCCCCh  
 mov         qword ptr [this],rcx  
 mov         qword ptr [&input],rdx  
 mov         rax,qword ptr [this]  
 movaps      xmm0,xmmword ptr [rdx]  
 movaps      xmmword ptr [rax],xmm0  // <- read access violation here
 mov         rax,qword ptr [this]  
 lea         rsp,[rbp]  
 pop         rbp  
 ret  

The Visual Studio debugger speaks of a read access violation, not sure why it is "read", I think it is a "write": Exception thrown at 0x00007FFB1A7D8390 (embree3.dll) in triangle_geometry.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

In this example the debugger also shows RAX = 0000007D335F1268, meaning the target for movaps is not 16 bytes aligned, which is the source of the crash, I think. The target here should be v, which is the only data member of vboolf: union { __m128 v; int i[4]; }; // data

Machines without AVX512 support do not show this problem. Setting max_isa to AVX2 on a AVX512 machine is also sufficient to work around this issue.

I am not sure why this happens, but it would be nice to have a working debug build. Can you fix this?

svenwoop commented 2 years ago

You are using Win32 and this has known issues with 16-bytes alignment. We currently only test and this support SSE2 with Win32 and are anyway planning to remove 32-bit support completely. Please build a 64 bit binary by using the "Visual Studio 16 2019 Win64" generator.

pborsutzki commented 2 years ago

Sorry, but I am not using Win32 here. It is Win64 as I am on an x64 host and the Visual Studio 16 2019 cmake generator uses Win64 on x64 hosts by default. There is no actual Visual Studio 16 2019 Win64 generator, the cmake docs state, you'd have to use the Visual Studio 16 2019 generator and set -A x64 to be explicit.

You can easily see that this build is using x64 from the 64-bit addresses printed in the error messages, the used registers with the r prefix or the 64-bit jump address in the assembly.

So please have another look. Thanks!

svenwoop commented 2 years ago

I also see problems with Debug build and AVX512 in our CI. We will look into this.