cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

initialization failure when running with gprof #28

Closed lee218llnl closed 1 year ago

lee218llnl commented 6 years ago

When running a code compiled with -pg, I see a lot of these messages:

opal3.94176psmi_context_open: hfi_userinit: failed, trying again (3/3) opal2.184559hfi_userinit: assign_context command failed: Interrupted system call opal2.184559PSM2 can't open hfi unit: -1 (err=23)

Is there any way to make the PSM2 library more tolerant to interrupts?

To complicate matters, this erroneous behaviour appears to have some scale and OS dependencies too. At larger scales it occurs more frequently. I also tried increasing the max retry count. On one of our clusters, I was able to run with a max retry count of 10, but on another, it still failed even with the count increased to 1000. If it helps, here's the OS versions (it works with an increased retry count on opal and fails on quartz):

[lee218@quartz2300:opa-psm2]$ uname -a Linux quartz2300 3.10.0-693.17.1.1chaos.ch6.x86_64 #1 SMP Fri Jan 26 13:23:01 PST 2018 x86_64 x86_64 x86_64 GNU/Linux

[lee218@opal186:opa-psm2]$ uname -a Linux opal186 3.10.0-862.2.3.1chaos.ch6.x86_64 #1 SMP Wed May 9 18:12:50 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

rwmcguir commented 6 years ago

So you point out a critical piece here about interrupts. The profiling interrupt is telling the driver to exit resource allocation, while it is attempting to carve our resources on a PCIe device, which is not fast. So really the fact that we had to implement retries seems to be something the driver needs to fix, potentially disable interrupts during a large section of the resource allocation. PSM2 can't really influence interrupts here, nor can it change how long an IOCTL takes. We also tried a higher retry count and you can now see why we limited it to 3, little payback in higher values. Maybe a timeout in between calls, but from a scalability perspective this may not be good. Perhaps can you attempt to pin the process to a different core, this may limit interrupts from the kernel from other sources, though not sure how much this may be happening? I have not looked deeply, but perhaps we can find a pragma to put around this call to have the profiler not interrupt this one call, or slow down the profiling interrupt rate to give the IOCTL time to complete.

For completeness can you post the version of libpsm2 and the hfi1 driver that are used in these cases.

lee218llnl commented 6 years ago

I cloned the opa-psm2 git repo. This is commit 0f9213e7af8d32c291d4657ff4a3279918de1e60.

[lee218@opal186:tests]$ rpm -qa | grep hfi1 hfi1-firmware-0.9-46.1.ch6.x86_64

lee218llnl commented 6 years ago

This may be a hack, but I found that if I wrap the retry loop with a disable/reenable of the SIGPROF flag (similar to http://www.linuxprogrammingblog.com/code-examples/blocking-signals-with-sigprocmask, but without adding a signal handler for SIGPROF), then I can run OK with gprof. I'm not sure if this is something you want in the production code, but at least it has me moving forward in the meantime.

ddalessa commented 1 year ago

This has been lingering for 4+ years now. If it's still an issue please let us know. This issue seems to be inherited from when we were still part of Intel.