RenderKit / ospray

An Open, Scalable, Portable, Ray Tracing Based Rendering Engine for High-Fidelity Visualization
http://ospray.org
Apache License 2.0
1k stars 182 forks source link

Spontaneous Reboots - Running ospTutorialBouncingSpheres.exe #383

Closed Micros-DJB closed 4 years ago

Micros-DJB commented 4 years ago

We have an Intel® Core™ i5-8365UE (Whiskey Lake) Processor system board that will spontaneously reboot when running Intel OSPRay (“opsTutorialBouncingSpheres.exe”). It can take a pretty long time to fail - typically over 12 hours, sometimes as long as a couple days.

Windows 10 Enterprise LTSC 2019 Event Logger reports an Event 41: (The system has rebooted without cleanly shutting down first.)

The same board with the Whiskey Lake i3 and i7 CPUs also fail in this manner. (Intel® Core™ i3-8145UE Processor, Intel® Core™ i7-8665UE Processor)

The Celeron CPUs from the same Whiskey Lake processor family do not seem to be affected. (Intel® Celeron® Processor 4305UE)

We do not see any problems with some other processors that we test similarly. Those processors which do not fail in this fashion so far include Broadwell i5 (i5-5350U), Apollo Lake Atom (x5-E3930), and another variant of the Whiskey Lake family (i5-8265U).

So to this point, failures (spontaneous reboots) are only occurring with the Whiskey Lake Core Embedded family, i3-8145UE, i5-8365UE, and i7-8665UE (excluding the Celeron 4305UE mentioned above).

I have submitted a case at Intel Premiere Support, and the reply is: “OSPRay is under active development and has its own Support path though https://github.com/ospray/OSPRay/issues and not IPS. Please use the github site to report your issue to the OSPRay development team.”

After I provided the IPS team with a bit more information, the follow up reply is still that I should try to “work with the OSPRay team.” The IPS engineer stated further that “If there is a CPU issue the OSPRay team will work thru our internal channels to resolve any issues. If you wish, please provide a tracking number from the OSPRay team and we will record and track their work if you provide us updates. Unfortunately, we don’t have visibility into their development system.”

I can advise that since our last input to IPS, we have updated our BIOS with latest micro-codes and also updated our Windows 10 with the latest Microsoft updates. Several systems are now in test, some are running over 24 hours so far, and some already more than 48 hours. Maybe this is positive.

Anyways, any advice, suggestions or comments on this are appreciated.

Thanks.

jeffamstutz commented 4 years ago

I'm not sure what to do about this: we don't do anything special wrt what CPU is being used. We only activate code paths based on what ISA features are available (e.g. AVX/AVX2/AVX512), so I don't think there's anything we can do here. The app you are using (ospTutorialBouncingSpheres) was intended as a sample way of using the OSPRay API, as OSPRay is a library intended for application developers who want to embed interactive 3D rendering in their application. I would also imagine that 12+ hour rendering jobs (i.e. running our tutorial for that long) is better suited for Xeon-class server CPUs (w/ ECC memory, for example), but I am not an expert on those kinds of details.

johguenther commented 4 years ago

Just some idea: it could be a hardware issue like an insufficient power supply (which is triggered by heavy CPU use) or faulty memory. Your could try https://en.wikipedia.org/wiki/Prime95#Use_for_stress_testing or https://www.memtest86.com/ to double check.

Micros-DJB commented 4 years ago

To jeffamstutz - thanks for the insights into what can or can't be done within this app/group. I was not really expecting that there would be much that could be done, but was following up on the Intel Premiere Support group's suggestion.

We did in fact find a glitch in the IMVP8 power supply when operating under extreme switching conditions, and made adjustments to the power supply chipset compensation network which solved the spontaneous reboot. So the comments by johguenther were definitely in the ballpark.

Sorry to not have updated sooner, and thanks for the consideration to the issue and the suggestions provided.