clangupc / clang-upc

Clang UPC Front-End
https://clangupc.github.io/
Other
17 stars 5 forks source link

Runtime timeouts w/ over-commit on BSD systems #62

Open PHHargrove opened 10 years ago

PHHargrove commented 10 years ago

I have clang-upc tests running twice per week on NetBSD-6.1 on both amd64 and x86 targets. I have test on each platform of both the libupc runtime and upcr. Nenad has access to these systems.

A debug build of the Berkeley test suite runs in approximately 2hr 15min with upcr while the libupc build is killed by the test harness script after 15hr. Other than the compiler.spec and paths, the arguments to the harness are identical.

You can see the "clean" results of cupc+upcr at: https://upc-bugs.lbl.gov/upc_tests/chistory.php?date=2014-07-08&start_date=2014-07-01&cr=run&config=netbsd6-amd64-cupc-upcr&optdbg=dbg

And the incomplete results for cupc+libupc at: https://upc-bugs.lbl.gov/upc_tests/chistory.php?date=2014-07-08&start_date=2014-07-01&cr=run&config=netbsd6-amd64-cupc&optdbg=dbg

Once can replace "amd64" by "i386" in those URLs to see the x86 target results.

Based on the specific tests I see failing with timeouts, I'd guess the most likely problem is some issue in the collectives.

These are runs with 4 UPC threads on 2 CPU cores, which might be a contributing factor.

I have also have testers on FreeBSD where I am not seeing these problems. However, those testers running 2 UPC threads on 2 CPU cores and therefore are not directly comparable. I am trying to gather comparable results now.

PHHargrove commented 10 years ago

I've not waited through the whole suite, but picking one or two of the tests which timeout on NetBSD and trying equivalent runs (4 UPC threads on 2 CPU cores) shows the same issue on FreeBSD. Again, there is no problem with the Berkeley runtime, only with clang-upc's libupc.

I will try Linux with a 2-cpu VM as time allows, and Mac OS X when I locate a appropriate system.

PHHargrove commented 10 years ago

I have tested a 2-core Solaris and Mac OS X systems, where the problem does not occur.

I have so far been unable to build clang on the 2-core Linux VM I setup (oom linking clang). And need to start over with a release build instead of debug.

PHHargrove commented 10 years ago

Now have results showing that no problem exists on Linux.

PHHargrove commented 10 years ago

I now am stlightly less certain about some of my data.

I stand by the fact that on the original NetBSD platform there is a HUGE difference between "cupc" and "cupc-upcr". The following shows that cupc is at least 169 times slower, if not just plain hung:

$ grep try_reduce_nc_st cupc*/runtime/log/harness/run.rpt 
cupc-upcr/runtime/log/harness/run.rpt:[collectives/try_reduce_nc_st04]   8sec  20140704_115243  SUCCESS
cupc/runtime/log/harness/run.rpt:[collectives/try_reduce_nc_st04]   1353sec  20140707_123838  FAILED (TIME/NEW)

Running the cupc configuration with 4 UPC threads on 2-core systems running Linux (VM same as netbsd6-amd64 system), Mac OS X (my laptop) and Solaris (same h/w as netbsd6-i386 system) I did not reproduce the problem. The tests that timeout after 22min on netbsd instead run in 10s of seconds or less.

I had reported earlier that FreeBSD showed the problem on cupc but not on cupc-upcr. However, I am now seeing both of those configurations timeout after 22min. I need to investigate further.