RenderKit / embree

Embree ray tracing kernels repository.
Apache License 2.0
2.36k stars 385 forks source link

SSE detection fails for CPUs not having XSAVE / XSTOR (v2.7.1) #59

Closed lbarbieri closed 8 years ago

lbarbieri commented 8 years ago

Hello,

I'm running Embree v2.7.1 on a CPU that supports SSE (2, 3, 4.1, 4.2) however I get error message "CPU does not support SSE2" along with an exception having this stack trace:

#0  rtcore_error (str=..., error=RTC_UNSUPPORTED_CPU, this=0x14dd390) at /home/lbarbieri/foliage3/external_src/embree-external/v2.7.1/src/EmbreeExt/kernels/common/rtcore.h:64
#1  embree::State::verify (this=0x105f408) at /home/lbarbieri/foliage3/external_src/embree-external/v2.7.1/src/EmbreeExt/kernels/common/state.cpp:93
#2  0x00007ffff50ace58 in embree::Device::Device (this=0x105f400, cfg=0x0, singledevice=true) at /home/lbarbieri/foliage3/external_src/embree-external/v2.7.1/src/EmbreeExt/kernels/common/device.cpp:92
#3  0x00007ffff50ca45e in embree::rtcInit (cfg=0x0) at /home/lbarbieri/foliage3/external_src/embree-external/v2.7.1/src/EmbreeExt/kernels/common/rtcore.cpp:61
#4  0x0000000000584f2e in main (argc=3, argv=0x7fffffffe1d8) at /home/lbarbieri/foliage3/apps/raytracer2/raytracer.cpp:3236

However Embree previously ran fine on this exact hardware as of v2.6.2.

It looks like getCPUFeatures() is requiring OS-enabled XSAVE/XSTOR to be present for any SSE support to be detected (which is failing for my processor - CPUID data given below), but I believe this requirement to be unnecessary.

Just to test, I disabled the XSAVE/XSTOR requirement for SSE detection, and things seemed to run OK. Here's what I did:

*** v2.7.1/src/EmbreeExt/common/sys/sysinfo.cpp 2015-10-29 13:46:51.560237773 -0400
--- patch/src/EmbreeExt/common/sys/sysinfo.cpp  2015-10-29 13:01:26.343225827 -0400
***************
*** 250,261 ****
        zmm_enabled = ymm_enabled && ((xcr0 & 0xE0) == 0xE0); /* check if OPMASK state, upper 256-bit of ZMM0-ZMM15 and ZMM16-ZMM31 state are enabled in XCR0 */
      }

!     if (xmm_enabled && cpuid_leaf_1[EDX] & CPU_FEATURE_BIT_SSE   ) cpu_features |= CPU_FEATURE_SSE;
!     if (xmm_enabled && cpuid_leaf_1[EDX] & CPU_FEATURE_BIT_SSE2  ) cpu_features |= CPU_FEATURE_SSE2;
!     if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE3  ) cpu_features |= CPU_FEATURE_SSE3;
!     if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSSE3 ) cpu_features |= CPU_FEATURE_SSSE3;
!     if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE4_1) cpu_features |= CPU_FEATURE_SSE41;
!     if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE4_2) cpu_features |= CPU_FEATURE_SSE42;
      if (               cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_POPCNT) cpu_features |= CPU_FEATURE_POPCNT;
      if (ymm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_AVX   ) cpu_features |= CPU_FEATURE_AVX;
      if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_F16C  ) cpu_features |= CPU_FEATURE_F16C;
--- 250,261 ----
        zmm_enabled = ymm_enabled && ((xcr0 & 0xE0) == 0xE0); /* check if OPMASK state, upper 256-bit of ZMM0-ZMM15 and ZMM16-ZMM31 state are enabled in XCR0 */
      }

!     if (/*xmm_enabled && */cpuid_leaf_1[EDX] & CPU_FEATURE_BIT_SSE   ) cpu_features |= CPU_FEATURE_SSE;
!     if (/*xmm_enabled && */cpuid_leaf_1[EDX] & CPU_FEATURE_BIT_SSE2  ) cpu_features |= CPU_FEATURE_SSE2;
!     if (/*xmm_enabled && */cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE3  ) cpu_features |= CPU_FEATURE_SSE3;
!     if (/*xmm_enabled && */cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSSE3 ) cpu_features |= CPU_FEATURE_SSSE3;
!     if (/*xmm_enabled && */cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE4_1) cpu_features |= CPU_FEATURE_SSE41;
!     if (/*xmm_enabled && */cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_SSE4_2) cpu_features |= CPU_FEATURE_SSE42;
      if (               cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_POPCNT) cpu_features |= CPU_FEATURE_POPCNT;
      if (ymm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_AVX   ) cpu_features |= CPU_FEATURE_AVX;
      if (xmm_enabled && cpuid_leaf_1[ECX] & CPU_FEATURE_BIT_F16C  ) cpu_features |= CPU_FEATURE_F16C;

(I don't know if this is the right way to do this, just wanted to see if things would run.)

Any feedback on this issue would be appreciated.

Thanks for your time! Lou

Platform information

Following is some info describing the machine I'm running upon.

Embree info

Please note that the SSE ISAs are only showing-up because I hacked-around this issue (as noted above) so I could run.

Embree Ray Tracing Kernels 2.7.1 (Oct 30 2015)
  Compiler : GCC 4.9.2
  Platform : Linux (64bit)
  CPU      : Nehalem (GenuineIntel)
  ISA      : SSE SSE2 SSE3 SSSE3 SSE41 SSE42 POPCNT (SSE SSE2 SSE3 SSSE3 SSE41 SSE42 POPCNT )
  Threads  : 24
  MXCSR    : FTZ=1, DAZ=1
  Config   : Release SSE2 SSE4.2 AVX AVX2 internal_tasking_system intersection_filter bufferstride 

CPUID (raw flags)

Of interest is (at least) the line for 0x00000001.

CPU:
   0x00000000 0x00: eax=0x0000000b ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x000206c2 ebx=0x03200800 ecx=0x029ee3ff edx=0xbfebfbff
   0x00000002 0x00: eax=0x55035a01 ebx=0x00f0b2ff ecx=0x00000000 edx=0x00ca0000
   0x00000003 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000004 0x00: eax=0x3c004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
   0x00000004 0x01: eax=0x3c004122 ebx=0x00c0003f ecx=0x0000007f edx=0x00000000
   0x00000004 0x02: eax=0x3c004143 ebx=0x01c0003f ecx=0x000001ff edx=0x00000000
   0x00000004 0x03: eax=0x3c07c163 ebx=0x03c0003f ecx=0x00002fff edx=0x00000002
   0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x00001120
   0x00000006 0x00: eax=0x00000007 ebx=0x00000002 ecx=0x00000009 edx=0x00000000
   0x00000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000008 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x0000000a 0x00: eax=0x07300403 ebx=0x00000004 ecx=0x00000000 edx=0x00000603
   0x0000000b 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000003
   0x0000000b 0x01: eax=0x00000005 ebx=0x0000000c ecx=0x00000201 edx=0x00000003
   0x80000000 0x00: eax=0x80000008 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000001 edx=0x2c100800
   0x80000002 0x00: eax=0x65746e49 ebx=0x2952286c ecx=0x6f655820 edx=0x2952286e
   0x80000003 0x00: eax=0x55504320 ebx=0x20202020 ecx=0x20202020 edx=0x58202020
   0x80000004 0x00: eax=0x30393635 ebx=0x20402020 ecx=0x37342e33 edx=0x007a4847
   0x80000005 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000006 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x01006040 edx=0x00000000
   0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000100
   0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80860000 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000003
   0xc0000000 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000003

CPUID

Of interest are that all SSE features are supported, but neither OS-enabled XSAVE/XSTOR nor XSAVE/XSTOR states are.

CPU:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
      model           = 0xc (12)
      stepping id     = 0x2 (2)
      extended family = 0x0 (0)
      extended model  = 0x2 (2)
      (simple synth)  = Intel Core i7-900 (Gulftown B1) / Core i7-980X (Gulftown B1) / Xeon Processor 3600 (Westmere-EP B1) / Xeon Processor 5600 (Westmere-EP B1), 32nm
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x3 (3)
      cpu count                      = 0x20 (32)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = true
      PCLMULDQ instruction                    = true
      64-bit debug store                      = true
      MONITOR/MWAIT                           = true
      CPL-qualified debug store               = true
      VMX: virtual machine extensions         = true
      SMX: safer mode extensions              = true
      Enhanced Intel SpeedStep Technology     = true
      thermal monitor 2                       = true
      SSSE3 extensions                        = true
      context ID: adaptive or shared L1 data  = false
      FMA instruction                         = false
      CMPXCHG16B instruction                  = true
      xTPR disable                            = true
      perfmon and debug                       = true
      process context identifiers             = true
      direct cache access                     = true
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true
      extended xAPIC support                  = false
      MOVBE instruction                       = false
      POPCNT instruction                      = true
      time stamp counter deadline             = false
      AES instruction                         = true
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x55: instruction TLB: 2M/4M pages, fully, 7 entries
      0xff: cache data is in CPUID 4
      0xb2: instruction TLB: 4K, 4-way, 64 entries
      0xf0: 64 byte prefetching
      0xca: L2 TLB: 4K, 4-way, 512 entries
   processor serial number: 0002-06C2-0000-0000-0000-0000
   deterministic cache parameters (4):
      --- cache 0 ---
      cache type                           = data cache (1)
      cache level                          = 0x1 (1)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0xf (15)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x7 (7)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 63
      --- cache 1 ---
      cache type                           = instruction cache (2)
      cache level                          = 0x1 (1)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0xf (15)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x3 (3)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 127
      --- cache 2 ---
      cache type                           = unified cache (3)
      cache level                          = 0x2 (2)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0xf (15)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x7 (7)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 511
      --- cache 3 ---
      cache type                           = unified cache (3)
      cache level                          = 0x3 (3)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1f (31)
      extra processor cores on this die    = 0xf (15)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0xf (15)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = true
      complex cache indexing               = false
      number of sets - 1 (s)               = 12287
   MONITOR/MWAIT (5):
      smallest monitor-line size (bytes)       = 0x40 (64)
      largest monitor-line size (bytes)        = 0x40 (64)
      enum of Monitor-MWAIT exts supported     = true
      supports intrs as break-event for MWAIT  = true
      number of C0 sub C-states using MWAIT    = 0x0 (0)
      number of C1 sub C-states using MWAIT    = 0x2 (2)
      number of C2 sub C-states using MWAIT    = 0x1 (1)
      number of C3 sub C-states using MWAIT    = 0x1 (1)
      number of C4 sub C-states using MWAIT    = 0x0 (0)
      number of C5 sub C-states using MWAIT    = 0x0 (0)
      number of C6 sub C-states using MWAIT    = 0x0 (0)
      number of C7 sub C-states using MWAIT    = 0x0 (0)
   Thermal and Power Management Features (6):
      digital thermometer                     = true
      Intel Turbo Boost Technology            = true
      ARAT always running APIC timer          = true
      PLN power limit notification            = false
      ECMD extended clock modulation duty     = false
      PTM package thermal management          = false
      digital thermometer thresholds          = 0x2 (2)
      ACNT/MCNT supported performance measure = true
      ACNT2 available                         = false
      performance-energy bias capability      = true
   extended feature flags (7):
      FSGSBASE instructions                    = false
      IA32_TSC_ADJUST MSR supported            = false
      BMI instruction                          = false
      HLE hardware lock elision                = false
      AVX2: advanced vector extensions 2       = false
      SMEP supervisor mode exec protection     = false
      BMI2 instructions                        = false
      enhanced REP MOVSB/STOSB                 = false
      INVPCID instruction                      = false
      RTM: restricted transactional memory     = false
      QM: quality of service monitoring        = false
      deprecated FPU CS/DS                     = false
      intel memory protection extensions       = false
      AVX512F: AVX-512 foundation instructions = false
      RDSEED instruction                       = false
      ADX instructions                         = false
      SMAP: supervisor mode access prevention  = false
      Intel processor trace                    = false
      AVX512PF: prefetch instructions          = false
      AVX512ER: exponent & reciprocal instrs   = false
      AVX512CD: conflict detection instrs      = false
      SHA instructions                         = false
      PREFETCHWT1                              = false
   Direct Cache Access Parameters (9):
      PLATFORM_DCA_CAP MSR bits = 0
   Architecture Performance Monitoring Features (0xa/eax):
      version ID                               = 0x3 (3)
      number of counters per logical processor = 0x4 (4)
      bit width of counter                     = 0x30 (48)
      length of EBX bit vector                 = 0x7 (7)
   Architecture Performance Monitoring Features (0xa/ebx):
      core cycle event not available           = false
      instruction retired event not available  = false
      reference cycles event not available     = true
      last-level cache ref event not available = false
      last-level cache miss event not avail    = false
      branch inst retired event not available  = false
      branch mispred retired event not avail   = false
   Architecture Performance Monitoring Features (0xa/edx):
      number of fixed counters    = 0x3 (3)
      bit width of fixed counters = 0x30 (48)
   x2APIC features / processor topology (0xb):
      --- level 0 (thread) ---
      bits to shift APIC ID to get next = 0x1 (1)
      logical processors at this level  = 0x2 (2)
      level number                      = 0x0 (0)
      level type                        = thread (1)
      extended APIC ID                  = 3
      --- level 1 (core) ---
      bits to shift APIC ID to get next = 0x5 (5)
      logical processors at this level  = 0xc (12)
      level number                      = 0x1 (1)
      level type                        = core (2)
      extended APIC ID                  = 3
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = true
      execution disable                      = true
      1-GB large page support                = true
      RDTSCP                                 = true
      64-bit extensions technology available = true
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = true
      LZCNT advanced bit manipulation        = false
      3DNow! PREFETCH/PREFETCHW instructions = false
   brand = "Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz"
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = 0x0 (0)
      data # entries            = 0x0 (0)
      data associativity        = 0x0 (0)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = 0x0 (0)
      data # entries            = 0x0 (0)
      data associativity        = 0x0 (0)
   L1 data cache information (0x80000005/ecx):
      line size (bytes) = 0x0 (0)
      lines per tag     = 0x0 (0)
      associativity     = 0x0 (0)
      size (Kb)         = 0x0 (0)
   L1 instruction cache information (0x80000005/edx):
      line size (bytes) = 0x0 (0)
      lines per tag     = 0x0 (0)
      associativity     = 0x0 (0)
      size (Kb)         = 0x0 (0)
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 unified cache information (0x80000006/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x0 (0)
      associativity     = 8-way (6)
      size (Kb)         = 0x100 (256)
   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x0 (0)
      lines per tag         = 0x0 (0)
      associativity         = L2 off (0)
      size (in 512Kb units) = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      temperature sensing diode      = false
      frequency ID (FID) control     = false
      voltage ID (VID) control       = false
      thermal trip (TTP)             = false
      thermal monitor (TM)           = false
      software thermal control (STC) = false
      100 MHz multiplier control     = false
      hardware P-State control       = false
      TscInvariant                   = true
   Physical Address and Linear Address Size (0x80000008/eax):
      maximum physical address bits         = 0x28 (40)
      maximum linear (virtual) address bits = 0x30 (48)
      maximum guest physical address bits   = 0x0 (0)
   Logical CPU cores (0x80000008/ecx):
      number of CPU cores - 1 = 0x0 (0)
      ApicIdCoreIdSize        = 0x0 (0)
   (multi-processing synth): multi-core (c=6), hyper-threaded (t=2)
   (multi-processing method): Intel leaf 0xb
   (APIC widths synth): CORE_width=5 SMT_width=1
   (APIC synth): PKG_ID=0 CORE_ID=1 SMT_ID=1
   (synth) = Intel Xeon Processor 3600 (Westmere-EP B1) / Xeon Processor 5600 (Westmere-EP B1), 32nm
svenwoop commented 8 years ago

Ok this is a bug. We should assume xmm_enabled=true when the XSAFE feature is not present.

Could you please verify that it also works for your machine when you initialize "bool xmm_enabled=true;" at its declaration.

lbarbieri commented 8 years ago

Hi Sven - thanks for your response.

Yes - making that change worked.

svenwoop commented 8 years ago

Thanks for reporitng the issue. I also verified that is works on an older machine. I updated the v2.7.1 release with that fix, as the issue is quite serious. Just make a git pull and you should get the updated v2.7.1 tag.

lbarbieri commented 8 years ago

Thanks very much!