ExpectationMax / opencurrent

OpenCurrent library for solving PDEs using CUDA (code.google.com/p/opencurrent)
Apache License 2.0
6 stars 2 forks source link

Consistent RayleighTest failure ("unspecified launch failure") #2

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
When running the RayleighTest unit test I encounter a consistent
"unspecified launch failure".  This error is observed for both release and
debug versions.

What steps will reproduce the problem?
1. Run the RayleighTest unit test ( ./utest -gpu 0 RayleighTest ) Failure
observed for the "run resolution 32" case.  See rt_output.txt for sample
output.

What version of the product are you using? On what operating system?
Running current release version available via hg clone.

Running on CentOS 5.4 x86-64 system (uname -a):

Linux manzano 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux

Using GeForce GTX 260 MaxCore device (216 cores / 27 multiprocessors).  See
dq_output.txt for more details.

Original issue reported on code.google.com by doug.enr...@gmail.com on 31 Oct 2009 at 5:44

Attachments:

GoogleCodeExporter commented 8 years ago
please respond to dpephd1@gmail.com or dpephd-nvidia@yahoo.com

Original comment by dpep...@gmail.com on 31 Oct 2009 at 6:22

GoogleCodeExporter commented 8 years ago
additional information concerning driver from /proc/nvidia/version:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  190.18  Wed Jul 22 15:36:09 PDT 
2009
GCC version:  gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

from doug.enright@gmail / dpephd1@gmail.com - please use dpephd1@gmail.com in 
replies

Original comment by dpep...@gmail.com on 31 Oct 2009 at 6:32

GoogleCodeExporter commented 8 years ago
Tried running this test using Windows XP Pro (32bit) using VC 8 (Visual Studio 
2005
Express Ed.).  

The test still failed, but not due to an "unspecified launch failure".  

Now I get the following:

================================================================================
====
run resolution 32
deltaT = 657.000000
init min/max t = 10.265625 646.734375
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error 
after: L2 =
0.141615 (0.374520x), Linf = 3.4
81782 (0.973917x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg did not converge, 
retrying with
zeroed initial search vector
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error 
after: L2 =
0.048482 (12.194238x), Linf = 1.
842807 (1.782636x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg failed
[WARNING] Failure: Sol_ProjectDivergence3DDevice::solve - could not solve for 
pressure

[ERROR] Equation::advance - failed to advance equation
[ASSERT] RayleighTest::assert_true at ..\..\src\tests\rayleigh.cpp line 239
[FAILED] RayleighTest

There were failures.
================================================================================
====

This error appears to be more numerically oriented than a runtime/hardware 
issue.  

I would appreciate an guidance in understanding both issues raised with regards 
to
the RayleighTest unit test results observed.

Thanks,

dpe

Original comment by dpep...@gmail.com on 3 Nov 2009 at 7:30

Attachments:

GoogleCodeExporter commented 8 years ago
Some additional data when running under CentOS.

I tried running RayleighTest again this morning and the first time I ran it, 
the test
never completed and pretty much stalled at the start of the run resolution 32 
case. 
Stalled being making no progress after ~10 minutes at a particular time step 
... at
least that is what I interpret the "Log ratio: " output to mean.  

I ctrl-c'd the run and restarted and obtained the launch error failure noted 
above. 
It appears likely that the windows results, i.e. failure to converge warning, is
related to the above observed behavior.  'ctrl-c' does not appear to result in a
clean termination of the program and subsequent restarts result in a 
runtime/hardware
failure.  

For the opencurrent developers - why would the warning noted above be issued 
under
windows, but not CentOS?

dpe

Original comment by dpep...@gmail.com on 3 Nov 2009 at 4:11

GoogleCodeExporter commented 8 years ago
Additional windows driver and system information

Original comment by dpep...@gmail.com on 4 Nov 2009 at 9:55

Attachments:

GoogleCodeExporter commented 8 years ago
Some additional unit test information (CentOS).  Both sm13-rel and sm13-dbg.

Silently hung process for sm13-rel (ctrl-c) to quit during unit test 17.

sm13-dbg finished to completion with test failures noted.

Original comment by dpep...@gmail.com on 4 Nov 2009 at 3:57

Attachments:

GoogleCodeExporter commented 8 years ago
update - able to get utest to compile under CUDA 2.3 SDK for further testing and
running under emulation mode.

Original comment by dpep...@gmail.com on 15 Nov 2009 at 8:46

GoogleCodeExporter commented 8 years ago
Ran unit tests compiled with debug (dbg) and release (rel) flags issued upon
compilation.  Note that the debug and release flags thrown when compiling under 
the
SDK are different than those with CMake.

Only NSTest failed when compiled with debug flags.  Warning / Error messages 
issued are:

================================================================================
====
[INFO] Running on GPU 0
Running tests: NSTest
running NSTest
Frame 0
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error 
after: L2 =
0.000000000000 (nanx), Linf = 
0.000061035156 (16056.320312x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg did not converge, 
retrying with
zeroed initial search vector
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error 
after: L2 =
0.000000000000 (nanx), Linf = 
0.000061035156 (16056.320312x)
[WARNING] Sol_MultigridPressure3DBase::solve - do_fmg failed
[WARNING] Failure: Sol_ProjectDivergence3DDevice::solve - could not solve for 
pressure

[ERROR] Equation::advance - failed to advance equation
================================================================================
====

MultigridMixedTest issued warning about failure to converge:

================================================================================
====
[INFO] Running on GPU 0
Running tests: MultigridMixedTest
running MultigridMixedTest
[WARNING] Sol_MultigridPressure3DBase::do_fmg - Failed to converge, error 
after: L2 =
0.020313601941 (4447.862672x),
 Linf = 0.000000000000 (nanx)
================================================================================
====

more unit tests failed under release compilation, namely:

MultigridDoubleTest, NSTest, RayleighNoSlipTest, RayleighTest, and 
RayleighTimingTest

in addition warnings were issued for Advection3DDoubleSwirlTest, 
LockExDoubleTest,
and LockExTest.  All warnings were issued due to failure of do_fmg (I assume the
multigrid solver) to converge, e.g. for Advection3DDoubleSwirlTest:

================================================================================
=====
Advection3DDoubleSwirlTest_rel.out:[WARNING] 
Sol_MultigridPressure3DBase::do_fmg -
Failed to converge, error after: 
L2 = 0.000000000000 (nanx), Linf = 1449.953977296822 (0.021035x)
Advection3DDoubleSwirlTest_rel.out:[WARNING] Sol_MultigridPressure3DBase::solve 
-
do_fmg did not converge, retrying 
with zeroed initial search vector
Advection3DDoubleSwirlTest_rel.out:[WARNING] 
Sol_MultigridPressure3DBase::do_fmg -
Failed to converge, error after: 
L2 = 0.000000000000 (nanx), Linf = 926.607348723281 (0.032916x)
Advection3DDoubleSwirlTest_rel.out:[WARNING] Sol_MultigridPressure3DBase::solve 
-
do_fmg failed
Advection3DDoubleSwirlTest_rel.out:[WARNING] Failure:
Sol_ProjectDivergence3DDevice::solve - could not solve for pressure
================================================================================
=====

Errors were issued primarily due to "unspecified launch failure" CUDA error 
messages,
e.g. for MultigridDoubleTest:

================================================================================
=====
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) - 
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) - 
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0 
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) - 
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDeviceD_relax(256) - 
CUDA
error "unspecified launch failu
re"
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0 
MultigridDoubleTest_rel.out:[ERROR] Sol_MultigridPressure3DDevice::relax - 
failed at
level 0
MultigridDoubleTest_rel.out:[ERROR] Grid3DDeviceT::clear_zero - cudaMemset 
failed
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(128) - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 1 
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD::bind_tex_calculate_residual - Could not bind tex
ture U
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD_calculate_residual(256) - CUDA error "unspecified
 launch failure"
MultigridDoubleTest_rel.out:[ERROR] 
Sol_MultigridPressure3DDeviceD_restrict(128) -
CUDA error "unspecified launch fa
ilure"
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR] 
Sol_MultigridPressure3DDevice::restrict_residuals
- failed at level 0 -> 1
MultigridDoubleTest_rel.out:[ERROR] Grid3DDeviceT::clear_zero - cudaMemset 
failed
MultigridDoubleTest_rel.out:[ERROR]
kernel_apply_3d_boundary_conditions_level1_nocorners(256) - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDevice::apply_boundary_conditions - failed at level 0 
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD::bind_tex_calculate_residual - Could not bind tex
ture U
MultigridDoubleTest_rel.out:[ERROR]
Sol_MultigridPressure3DDeviceD_calculate_residual(256) - CUDA error "unspecified
 launch failure"
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid 
resource handle"
MultigridDoubleTest_rel.out:[ERROR] 
Sol_MultigridPressure3DDevice::restrict_residuals
- failed at level 0 -> 0
MultigridDoubleTest_rel.out:[ERROR] Grid1DDeviceF::init - cudaMalloc failed
MultigridDoubleTest_rel.out:[ERROR] reduce_kernel - CUDA error "invalid 
resource handle"
================================================================================
=====

I had to modify some of the tests to use smaller grids since my GTX 260 only 
has 866
MB of onboard device memory as opposed to the 4GB available with the Tesla 
C1060 line.

Any assistance in resolving these unit test failure messages would be 
appreciated.

dpe

Original comment by dpep...@gmail.com on 25 Nov 2009 at 3:29

GoogleCodeExporter commented 8 years ago
additional comments about CMake and gdb debugging can be found here:

http://forums.nvidia.com/index.php?s=&showtopic=105027&view=findpost&p=960293

Original comment by dpep...@gmail.com on 5 Dec 2009 at 11:34

GoogleCodeExporter commented 8 years ago
Am now able to run successfully run the unit tests (make test target), however 
this
required an extensive investigation of the use of the various optimization 
flags and
directing these flags to specific parts of the CUDA compilation trajectory 
(nvopencc
and ptxas).  The use of CMake 2.8 with the updated FindCUDA.cmake capability was
required to enable cuda-gdb use.  I am also running CUDA 3.0 beta with the new 
driver
(195.17).  

I am only able to run at either the -O0 and -O1 optimization levels.  There is
minimal, to no performance difference between the different optimization 
levels. 
There is a significant performance difference between -O0 and the use of the 
debug
flags (-g -G).  The various unit test performance results are attached.

I consider the issue closed, but would appreciate it if someone knowledgeable 
about
what the specific optimization level do could comment on this behavior and why 
it may
be happening.  I still observing unit test failures when trying to use 
optimization
level -O2.  It is unknown to me if use of -O2 can provide additional performance
benefit.  If someone could comment on this, I would appreciate it.

dpe

Original comment by dpep...@gmail.com on 15 Dec 2009 at 5:06

Attachments: