neiljamieso commented 11 years ago

Hi,

I tried to build the examples. This failed due to not finding a definition of "note" in Benchmark.hs. This was solved by adding import Criterion.IO.Printf to the import list.

tmcdonell commented 11 years ago

This should be fixed by AccelerateHS/accelerate-examples@91250cad40bcc5dc29a8297d9e567188986be11b. Can you confirm this?

neiljamieso commented 11 years ago

Yes. Built fine. Lots of "fails" in running with the CUDA backend. I'm using Cuda 5 - not sure if this breaks stuff. Do you want to see the list?

neiljamieso commented 11 years ago

Most of the fails were of the form...

: Failed: **\* Internal error in package accelerate *** **\* Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA.hs:153 (unhandled): CUDA Exception: unspecified launch failure fold-sum and fold-2D-sum also failed but differently fold-sum: Failed: > > > () : (-317.71545,-725.824) fold-2d-sum: Failed: > > > 0 : (10.181486,10.786726) > > > 3 : (-12.445062,-5.0869923) > > > 4 : (-22.740108,-34.520443) > > > 5 : (7.2517667,5.283786) > > > 6 : (-7.7953305,-19.361605) > > > 7 : (16.353685,16.106562) > > > 8 : (4.841938,3.6077766) > > > 9 : (6.7518387,2.3453445) > > > 10 : (-14.926775,-22.0668) > > > 12 : (-8.844832,0.86133194) > > > 13 : (31.82425,42.47469) > > > 15 : (-12.590198,-8.076189) > > > 16 : (2.4275239,-1.1079388) > > > 18 : (-4.6298413,10.507795) > > > 19 : (-5.7560434,-24.80141) > > > 20 : (-27.520971,-58.446945) > > > 21 : (-10.380567,-17.262444) > > > 24 : (-5.6269426,-3.59577) > > > 34 : (18.326572,15.697114) > > > 36 : (-21.652311,-20.826466) > > > 37 : (-0.69646883,-14.07115) > > > 39 : (-1.9313966,-1.646287) > > > 40 : (-15.114215,-4.3450966) > > > 41 : (9.819355,4.646344) > > > 42 : (-13.3020315,-18.24121) > > > 43 : (-4.780798,-11.156574) > > > 45 : (-13.909897,-19.179947) > > > 46 : (-24.877073,-25.394434) > > > 48 : (-10.917168,-7.612333) > > > 49 : (8.59276,-10.744858) > > > 50 : (-43.603035,-53.99748) > > > 53 : (17.853306,21.356565) > > > 55 : (-2.121977,6.3397703) > > > 58 : (-4.2652583,-2.5864878) > > > 59 : (-4.6431007,-3.1721497) > > > 60 : (14.112302,15.44854) > > > 61 : (-28.66971,-50.8897) > > > 63 : (-14.38963,-20.192778) > > > 67 : (-29.752752,-29.051735) > > > 70 : (18.686342,27.751282) > > > 76 : (-11.068267,-3.157248) > > > 77 : (-30.1085,-35.691612) > > > 78 : (17.865221,33.37813) > > > 79 : (12.610696,10.771452) > > > 80 : (13.958698,14.737689) > > > 83 : (-51.858498,-58.283985) > > > 85 : (12.039097,14.588022) > > > 86 : (-14.114648,-17.417624) > > > 89 : (23.790989,25.472948) > > > 90 : (-18.82345,-17.08065) > > > 91 : (2.463029,5.9130898) > > > 92 : (4.0238266,5.5120225) > > > 93 : (-8.8636265,-8.364969) > > > 95 : (-16.640343,-13.33732) > > > 96 : (10.943283,20.977047) > > > 97 : (-2.759805,-10.179357) > > > 99 : (7.461958,4.374811) > > > 101 : (6.5351143,10.87258) > > > 102 : (-8.328936,-3.353552) > > > 103 : (-8.919393,-10.651541) > > > 104 : (-8.599477,-32.173218) > > > 105 : (-3.4648807,-12.457461) > > > 107 : (-9.112293,-10.76436) > > > 109 : (10.36928,19.196201) > > > 111 : (-0.74972934,-8.263916) > > > 112 : (-1.4251958,-1.3936005) > > > 114 : (-5.7750616,-6.656393) > > > 115 : (-4.1570673,-5.0010214) > > > 118 : (-14.588455,-5.8673525) > > > 122 : (-3.905911,1.3459797) > > > 124 : (11.671464,13.249651) > > > 128 : (24.242702,31.903507) > > > 130 : (-5.312511,-8.756293) > > > 131 : (-17.744507,-24.541887) > > > 133 : (-3.0010543,-7.737555) > > > 136 : (8.380546,11.387158) > > > 138 : (11.308516,11.967691) > > > 139 : (-17.7391,-29.652555) > > > 141 : (-25.26024,-34.264626) > > > 145 : (-11.910921,-14.598899) > > > 147 : (18.361284,8.458666) > > > 148 : (-2.0598116,9.742126) > > > 151 : (-1.5615535,-6.330538) > > > 155 : (-14.633401,-24.910007) > > > 158 : (1.7897742,-3.3920808) > > > 160 : (7.98956,9.146147) > > > 161 : (-21.875072,-25.081263) > > > 162 : (5.615722,20.186003) > > > 163 : (9.19277,14.405633) > > > 166 : (-4.6076007,0.6831827) > > > 167 : (-10.567481,-4.0725036) > > > 169 : (0.4859029,-6.1355286) > > > 170 : (19.870667,19.815443) > > > 172 : (6.0666904,7.6584425) > > > 173 : (8.849107,0.12496734) > > > 175 : (-11.274898,-16.4241) > > > 177 : (-27.324623,-33.917286) > > > 178 : (0.21815288,3.8251867) > > > 179 : (-6.1652923,-4.998172) > > > 180 : (-14.112642,-19.027935) > > > 181 : (-2.080636,6.853819e-3) > > > 183 : (3.6447208,-6.9173365) > > > 187 : (-27.273254,-38.26088) > > > 189 : (-9.826919,-14.5337925) > > > 190 : (1.3126237,0.9761648) > > > 191 : (-4.1650763,-1.852829) > > > 193 : (18.610937,22.746304) > > > 194 : (-4.691451,-0.86483383) > > > 196 : (-4.7458477,-23.575771) > > > 197 : (-2.7342944,-10.165984) > > > 199 : (-11.298469,-18.151875) > > > 200 : (5.3247147,-4.0813465) > > > 201 : (14.916756,23.434582) > > > 203 : (-0.1067512,4.8686438) > > > 204 : (-14.124139,-4.513797) > > > 206 : (-7.185062,-0.58614635) > > > 207 : (-19.701935,-20.333096) > > > 208 : (-11.467451,-7.518866) > > > 210 : (31.49854,38.85581) > > > 212 : (-16.014204,-17.766535) > > > 216 : (-18.965578,-29.654585) > > > 220 : (-0.17519975,-5.1846743) > > > 225 : (16.0454,19.740955) > > > 226 : (-0.67587143,1.3499918) > > > 229 : (-21.621109,-23.055359) > > > 231 : (1.533406,0.9220514) > > > 232 : (1.5521168,-2.942934) > > > 235 : (-26.18992,-28.304138) > > > 237 : (-12.360111,-14.813786) > > > 244 : (-26.788136,-26.856113) > > > 245 : (-11.375093,-6.4627395) > > > 249 : (-14.0135765,-18.813738) > > > 251 : (-28.578781,-39.254063) > > > 261 : (23.480045,28.535007) > > > 263 : (-20.27542,-30.240715) > > > 264 : (1.0410566,5.445823) > > > 265 : (-12.174866,-11.87295) > > > 270 : (-2.2434764,1.3028297) > > > 271 : (-5.3730717,-7.069026) > > > 272 : (-32.547344,-40.939163) > > > 273 : (-11.036853,-14.617073) > > > 274 : (1.5726653,7.1989527) > > > 276 : (13.667664,-4.6318626) > > > 277 : (-19.315035,-14.617573) > > > 279 : (0.14692748,6.2511544) > > > 281 : (-0.6385382,0.5433495) > > > 282 : (0.13369226,-2.5549994) > > > 285 : (-25.613811,-23.304722) > > > 286 : (11.909087,6.9073195) > > > 287 : (11.177615,14.907998) > > > 289 : (8.337317,10.699486) > > > 291 : (-6.394571,-2.2123995) > > > 293 : (-12.401189,-4.961336) > > > 294 : (20.566023,22.415432) > > > 299 : (6.3981833,14.163654) > > > 301 : (-15.557607,-12.6597595) > > > 308 : (3.6762142,9.144186) > > > 310 : (0.26484996,-7.8996334) > > > 312 : (3.0426567,7.4979715) > > > 314 : (14.884919,14.266132)

tmcdonell commented 11 years ago

Hmm... what card are you running on, and what compute capability is it? The internal error especially is a bit worrying; I haven't seen that one in a while. The fold errors at least should be easier to debug.

neiljamieso commented 11 years ago

On 12/05/13 17:26, Trevor L. McDonell wrote:

Hmm... what card are you running on, and what compute capability is it? The internal error especially is a bit worrying; I haven't seen that one in a while. The fold errors at least should be easier to debug.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-17772910.

K1000M and using optirun to do the switching. It works fine with all the CUDA examples from nvidia. Ah! But maybe it is not working if you de-attach from the primary caller (as I suspect you do in the async functions). I will check that out - rings a bell from the bumblebee documentation.

Neil

tmcdonell commented 11 years ago

Yes, we do need to push and pop the CUDA context; I thought that that was enough, but my reading of the CUDA docs might be incorrect (and; I had not even heard of optirun before now!)

neiljamieso commented 11 years ago

Optirun is part of the bumblebee project to allow use of the Optimus GPUs under Linux. As it is not provided by nvidia it it possible it brings it's own issues. As I say though all the nvidia examples seem to run fine under it.

On 12/05/13 17:52, Trevor L. McDonell wrote:

Yes, we do need to push and pop the CUDA context; I thought that that was enough, but my reading of the CUDA docs might be incorrect (and; I had not even heard of |optirun| before now!)

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-17773100.

tmcdonell commented 11 years ago

Actually, does optimus aim to allow dynamic switching between a pair of low/high power GPUs? I have a similar problem with this dynamic switching (usually) not working under Mac OS X (#67), even with the NVIDIA drivers, although it does seem to work with the NVIDIA examples.

Does it work if you disable the switching and only use the fast GPU?

neiljamieso commented 11 years ago

Hi Trevor,

Yes. The Optimus is an NVIDIA design which uses the onboard Intel graphics most of the time. The display is ALWAYS done by the Intel unit, but the rendering is directed to the NVIDIA card on a switchable basis. Bumblebee is an opensource module to allow this switching on Linux (as NVIDIA neglect to provide this themselves). Choosing to run a programme using the NVIDIA card is done by running the programme under Optirun, so I enter: $ optirun ./accelerate-examples

I have been thinking, and the errors I am getting now are of language errors from the CUDA system. As I say ALL the NVIDIA code runs fine run under optirun - so I wonder if this is about changes to the CUDA language with CUDA 5. Have you had success with CUDA 5 and accelerate in other hardwares?

Cheers, Neil

On Mon, May 13, 2013 at 11:10 AM, Trevor L. McDonell < notifications@github.com> wrote:

Actually, does optimus aim to allow dynamic switching between a pair of low/high power GPUs? I have a similar problem with this dynamic switching (usually) not working under Mac OS X (#67https://github.com/AccelerateHS/accelerate/issues/67), even with the NVIDIA drivers, although it does seem to work with the NVIDIA examples.

Does it work if you disable the switching and only use the fast GPU?

— Reply to this email directly or view it on GitHubhttps://github.com/AccelerateHS/accelerate/issues/92#issuecomment-17787198 .

tmcdonell commented 11 years ago

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution?

What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens? https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

neiljamieso commented 11 years ago

Sorry Trevor. The "language" error was a language error of my own - due to wrapping at the edge of the terminal window. :-((

Will try your suggestion when I get home.

Neil

On Mon, May 13, 2013 at 9:28 PM, Trevor L. McDonell < notifications@github.com> wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution?

What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ?

Try changing this from forkOS to forkOn 0 and let me know what happens?

https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

— Reply to this email directly or view it on GitHubhttps://github.com/AccelerateHS/accelerate/issues/92#issuecomment-17801317 .

neiljamieso commented 11 years ago

Hi Trevor,

I have had another thought. Debian Wheezy (my OS) comes with gcc 4.7 as standard. CUDA 5 only works with gcc 4.6 (I tried 4.7). I thought of this last night and rebuild accelerate-cuda and accelerate-examples with gcc pointing to 4.6 (and g++ the same). This didn't make any difference, but I wonder if I need to rebuild the whole of haskell on gcc 4.6.

What is the default version of gcc on your OS?

Cheers Neil

tmcdonell commented 11 years ago

On my Mac it is gcc-4.2, but this is Apple's own version so I am not sure if that is comparable. The Ubuntu 12.04 box uses gcc-4.6.3.

Adding the flag -ddump-gc will give rather chatty output of whenever it tries to do memory allocations. Since this is quite fine grained, it might give a few more indications about what is going on (failed on the first attempt, worked for a while and then failed, etc)

neiljamieso commented 11 years ago

Hi Trevor,

Did make a difference. This is the output:

neil@debian-neil:~/.cabal/bin$ optirun bash
neil@debian-neil:~/.cabal/bin$ ./accelerate-examples --cuda -k
running with CUDA backend
to display available options, rerun with '--help'

map-abs: Ok
map-plus: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

map-square: Ok
zip: Ok
zipWith-plus: Ok
fold-sum: Failed:
 >>> () : (-284.77808,-299.1781)

fold-product: Ok
fold-maximum: Ok
fold-minimum: Ok
fold-2d-sum: Failed:
 >>> 4 : (3.5994253,4.1811037)
 >>> 5 : (-9.134442,-18.42069)
 >>> 6 : (-3.1958194,-7.940378)
 >>> 7 : (5.7998953,19.296043)
 >>> 9 : (14.701389,23.411243)
 >>> 12 : (26.571274,38.411602)
 >>> 14 : (23.842213,23.651949)
 >>> 15 : (-9.196165,-0.6621127)
 >>> 19 : (43.498287,45.15085)
 >>> 20 : (13.474283,14.403748)
 >>> 21 : (-11.930797,-9.856017)
 >>> 22 : (8.05154,8.746618)
 >>> 23 : (25.273453,25.40266)
 >>> 24 : (5.668702,7.682753)
 >>> 25 : (-23.540642,-24.084503)
 >>> 26 : (7.730505,3.7250352)
 >>> 28 : (-13.682002,-22.007523)
 >>> 30 : (22.153667,29.94641)
 >>> 32 : (3.9512172,4.8625793)
 >>> 34 : (-20.773705,-23.154194)
 >>> 35 : (14.610652,17.81879)
 >>> 36 : (-6.893841,-5.690979)
 >>> 38 : (3.470799,9.4239e-2)
 >>> 40 : (17.497482,27.669067)
 >>> 41 : (-3.0244708e-2,-3.8516002)
 >>> 43 : (19.843216,26.517456)
 >>> 44 : (5.050486,8.706543)
 >>> 47 : (-5.443891,-5.188139)
 >>> 49 : (-8.316223,-12.395588)
 >>> 51 : (5.367283,23.022243)
 >>> 52 : (11.321204,6.605723)
 >>> 53 : (16.014208,17.675938)
 >>> 57 : (-22.71127,-28.897242)
 >>> 60 : (-2.7958093,3.0328588)
 >>> 61 : (14.372042,10.27017)
 >>> 63 : (-13.966523,-16.551018)
 >>> 65 : (-2.3377113,-8.886295)
 >>> 66 : (0.41673332,4.9110966)
 >>> 67 : (-3.150734,1.390254)
 >>> 68 : (-9.262151,-4.612889)
 >>> 70 : (1.1192223,-0.87473106)
 >>> 71 : (-16.735855,-13.542116)
 >>> 72 : (-2.7853413,-3.259285)
 >>> 75 : (-0.42108774,12.822178)
 >>> 76 : (37.315483,58.080196)
 >>> 77 : (21.378624,24.565968)
 >>> 79 : (17.399918,11.301307)
 >>> 80 : (6.1325307,-3.117681)
 >>> 82 : (-25.688484,-23.890837)
 >>> 84 : (-29.327036,-46.779266)
 >>> 85 : (-12.640158,-17.59966)
 >>> 86 : (14.217806,22.999573)
 >>> 87 : (3.0769944,0.67498803)
 >>> 88 : (14.598545,13.440449)
 >>> 94 : (-18.738943,0.6576848)
 >>> 98 : (-1.2732513,-9.02783)
 >>> 100 : (14.017002,22.866009)
 >>> 102 : (10.585675,-0.76270866)
 >>> 103 : (-22.687687,-24.832624)
 >>> 105 : (13.726986,8.545394)
 >>> 108 : (18.212643,22.956026)
 >>> 110 : (-14.852369,-22.597391)
 >>> 111 : (2.3865306,5.926875)
 >>> 112 : (3.0377512,-1.880888)
 >>> 114 : (-10.134539,-9.8238)
 >>> 115 : (-4.3836536,3.3319654)
 >>> 116 : (-5.7152805,-14.443269)
 >>> 117 : (8.012011,7.6332164)
 >>> 118 : (-17.265642,-15.1257715)
 >>> 119 : (12.728009,14.087517)
 >>> 120 : (-18.342087,-23.154064)
 >>> 121 : (-21.715904,-17.897583)
 >>> 123 : (-13.022339,-12.231892)
 >>> 124 : (16.29696,30.115715)
 >>> 126 : (8.191839,16.790535)
 >>> 127 : (7.316367,14.373995)
 >>> 128 : (23.410019,22.88608)
 >>> 129 : (10.068765,-24.64301)
 >>> 131 : (-26.669355,-26.25417)
 >>> 132 : (2.4118686,-3.5020428)
 >>> 133 : (-13.115518,-21.87509)
 >>> 134 : (12.896856,12.63337)
 >>> 136 : (13.352133,12.780149)
 >>> 137 : (24.687658,17.437037)
 >>> 140 : (4.4784513,-8.002885)
 >>> 141 : (19.64967,21.850222)
 >>> 142 : (-17.395033,-11.799833)
 >>> 144 : (4.605325,9.768799)
 >>> 149 : (-27.127146,-31.195862)
 >>> 150 : (-20.15325,-38.91357)
 >>> 151 : (-11.284405,-7.634466)
 >>> 153 : (1.4470301,2.2499762)
 >>> 155 : (17.06059,23.061432)
 >>> 157 : (13.256235,9.830044)
 >>> 158 : (8.65885e-2,15.133558)
 >>> 161 : (19.461996,30.09988)
 >>> 162 : (8.695209e-2,1.2758055)
 >>> 164 : (0.23431987,-5.4021072)
 >>> 165 : (-8.806317,-7.660516)
 >>> 167 : (2.9375281,-1.7019806)
 >>> 168 : (4.8822374,1.7404442)
 >>> 169 : (-6.0983124,-6.616735)
 >>> 170 : (-10.859095,-24.070465)
 >>> 171 : (-30.173882,-38.876015)
 >>> 172 : (7.5324316,10.573803)
 >>> 173 : (-7.9830656,-0.61189365)
 >>> 174 : (3.8499007,2.8259583)
 >>> 175 : (9.863973,18.671043)
 >>> 176 : (1.5010693,7.730674)
 >>> 177 : (-19.172495,-15.866618)
 >>> 178 : (10.258595,11.646437)
 >>> 179 : (-36.72372,-32.991608)
 >>> 180 : (4.0878096,4.3566303)
 >>> 183 : (-16.212082,-12.850005)
 >>> 186 : (20.656956,44.957047)
 >>> 187 : (9.899384,8.580212)
 >>> 188 : (24.487984,24.992609)
 >>> 194 : (16.086586,6.133008)
 >>> 195 : (-12.79052,-14.317617)
 >>> 200 : (4.5302505,8.308535)
 >>> 201 : (-10.723634,-23.400677)
 >>> 202 : (-4.187149,-15.145685)
 >>> 203 : (-15.959601,-16.193207)
 >>> 204 : (27.673164,32.605988)
 >>> 205 : (-22.693754,-33.882385)
 >>> 206 : (-0.7072872,-1.9263825)
 >>> 208 : (-2.4695814,-0.21775436)
 >>> 209 : (-7.441179,-7.886807)
 >>> 216 : (-26.625347,-34.00032)
 >>> 217 : (-12.935532,-12.696256)
 >>> 219 : (10.233142,16.826408)
 >>> 223 : (-20.659527,-19.133957)
 >>> 225 : (4.6232724,-5.518243)
 >>> 226 : (-3.6734939e-3,-0.32396984)
 >>> 228 : (31.582458,35.58126)
 >>> 229 : (-0.7545265,-10.300518)
 >>> 231 : (12.414625,15.020456)
 >>> 234 : (10.174679,19.857052)
 >>> 235 : (-13.687687,-11.906177)
 >>> 239 : (-16.81191,-17.177837)
 >>> 241 : (5.6338625,7.43606)
 >>> 246 : (-6.5156856,-9.638809)
 >>> 247 : (-0.42078322,-4.191985)
 >>> 249 : (11.335211,10.828511)
 >>> 252 : (-0.8734268,-16.709965)
 >>> 253 : (2.7642574,5.442359)
 >>> 255 : (-15.736735,-13.98167)
 >>> 257 : (5.946913,2.0609694)
 >>> 258 : (-6.6435785,-8.290497)
 >>> 259 : (13.248286,15.020397)
 >>> 260 : (40.213238,62.449997)
 >>> 261 : (-1.8538256,-4.91119)
 >>> 266 : (10.244856,6.945044)
 >>> 268 : (-13.880142,-21.150314)
 >>> 269 : (14.314802,14.349737)
 >>> 270 : (-27.502745,-33.003326)
 >>> 271 : (10.64012,6.457108)
 >>> 272 : (-16.236614,-21.558899)
 >>> 273 : (20.561716,24.363443)
 >>> 274 : (-10.97512,-6.042589)
 >>> 280 : (-12.273643,-13.009692)
 >>> 283 : (3.3773353,8.302713)
 >>> 286 : (-1.6639676,-3.079587)
 >>> 287 : (-21.63964,-23.37448)
 >>> 290 : (-14.440636,-24.584656)
 >>> 291 : (0.17262441,-1.6445827)
 >>> 294 : (19.45585,29.862196)
 >>> 298 : (2.3329654,8.237259)
 >>> 303 : (15.277465,12.724495)
 >>> 304 : (-10.626967,-18.734402)
 >>> 309 : (-11.389035,-6.8129835)
 >>> 310 : (-7.8077154,-9.264032)
 >>> 311 : (3.3524702,-7.6005263)
 >>> 313 : (22.357534,21.090479)
 >>> 314 : (14.302358,4.895173)
 >>> 315 : (-32.722397,-41.946712)

fold-2d-product: Ok
fold-2d-maximum: Ok
fold-2d-minimum: Ok
foldseg-sum: Ok
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-1D: Ok
stencil-2D: Ok
stencil-3D: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-cross: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

stencil-3x3-pair: Ok
stencil2-2D: Ok
permute-hist: Ok
backpermute-reverse: Ok
backpermute-transpose: Ok
init: Ok
tail: Ok
take: Ok
drop: Ok
slit: Ok
gather: Ok
gather-if: Ok
scatter: Ok
scatter-if: Ok
sasum: Failed:
 >>> () : (50137.895,63516.633)

saxpy: Ok
dotp: Failed:
 >>> () : (120.643745,144.3627)

filter: Ok
smvm: Ok
black-scholes: Ok
radixsort: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

io: test: fromPtr Int
test: fromPtr (Int,Double)
test: toPtr Int16
test: toPtr Int32
test: toPtr Int64
test: fromArray Int
Ok
io: +++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
Ok
canny: Failed: no image file specified
integral-image: Failed: no image file specified
slices: Ok
slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at 
https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:246 (unhandled): CUDA Exception: invalid 
context handle

slices: Ok
slices: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
sharing-recovery: Ok
bound-variables: Ok

On 13/05/13 21:28, Trevor L. McDonell wrote:

Hi Neil,

I am using CUDA 5 and it has worked for me --- this is on Mac OS X and Ubuntu. It might make a difference if you're on a different linux distribution?

What do you mean by language errors from the CUDA system? Different errors from the earlier "unspecified launch failure" ?

Try changing this from |forkOS| to |forkOn 0| and let me know what happens? https://github.com/AccelerateHS/accelerate-cuda/blob/master/Data/Array/Accelerate/CUDA/Async.hs#L36

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-17801317.

neiljamieso commented 11 years ago

Hullo Trevor,

SUCCESS!

I rebuilt all the accelerate packages (with the change to forkOn in place) and the accelerate-examples all run perfectly!

Not sure how to interpret the benchmarks but am very pleased to have it going.

May I also say that the code is beautiful. Don't understand it all yet, but it is very aesthetically pleasing what I have read.

Neil

neiljamieso commented 11 years ago

OOPS! Duh! I didn't turn on --cuda, so of course they all looked ok.

Sorry. No change with cuda backend. :-((

All this regarding accelerate-examples of course.

Neil

tmcdonell commented 11 years ago

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

neiljamieso commented 11 years ago

Will do Trevor.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

I got this error in trying to compile the examples:

[ 6 of 12] Compiling Test.IndexSpace ( examples/nofib/Test/IndexSpace.hs, dist/build/accelerate-nofib/accelerate-nofib-tmp/Test/IndexSpace.o )

examples/nofib/Test/IndexSpace.hs:170:71: Ambiguous occurrence even' It could refer to eitherP.even', imported from Prelude' at examples/nofib/Test/IndexSpace.hs:6:1-60 (and originally defined inGHC.Real') or A.even', imported fromData.Array.Accelerate' at examples/nofib/Test/IndexSpace.hs:20:1-60 (and originally defined in `accelerate-0.14.0.0:Data.Array.Accelerate.Language')

I'll have a look and change to A.even as I assume that's what you meant.

Neil.

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

This fixed it: -- gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef even mapv maskv defaultv xs gatherIfAcc even' mapv maskv defaultv xs .==. gatherIfRef P.even mapv maskv defaultv xs

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

Another one: [18 of 36] Compiling Gather ( examples/tests/primitives/Gather.hs, dist/build/accelerate-examples/accelerate-examples-tmp/Gather.o )

examples/tests/primitives/Gather.hs:41:11: Ambiguous occurrence even' It could refer to eitherAcc.even', imported from Data.Array.Accelerate' at examples/tests/primitives/Gather.hs:9:1-48 (and originally defined in accelerate-0.14.0.0:Data.Array.Accelerate.Language') or P.even', imported fromPrelude' at examples/tests/primitives/Gather.hs:10:1-33 (and originally defined in `GHC.Real') Failed to install accelerate-examples-0.14.0.0 c

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

And: [19 of 36] Compiling Scatter ( examples/tests/primitives/Scatter.hs, dist/build/accelerate-examples/accelerate-examples-tmp/Scatter.o )

examples/tests/primitives/Scatter.hs:52:11: Ambiguous occurrence even' It could refer to eitherP.even', imported from Prelude' at examples/tests/primitives/Scatter.hs:16:1-44 (and originally defined inGHC.Real') or Acc.even', imported fromData.Array.Accelerate' at examples/tests/primitives/Scatter.hs:17:1-59 (and originally defined in `accelerate-0.14.0.0:Data.Array.Accelerate.Language')

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

Both fixed same way, and all now compile...Lets see how they run!

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

neiljamieso commented 11 years ago

This is the output...I use Ctrl-C during the 4th slices as it seemed to hang.

neil@debian-neil:~/.cabal/bin$ optirun --no-xorg ./accelerate-examples --cuda -k running with CUDA backend to display available options, rerun with '--help'

map-abs: Ok map-plus: Ok map-square: Ok zip: Ok zipWith-plus: Ok fold-sum: Failed:

() : (-21.361326,32.731934)

fold-product: Ok fold-maximum: Ok fold-minimum: Ok fold-2d-sum: Failed:

1 : (3.9905946,12.912712) 3 : (4.3853145,6.6357403) 4 : (8.841903,3.564476) 5 : (22.403717,22.856863) 6 : (-7.058512,-0.8101158) 7 : (-13.209917,-14.428578) 8 : (3.6516001,3.9791288) 9 : (1.5006628,1.639061) 10 : (8.085807,10.32614) 11 : (11.110486,13.271563) 12 : (11.344211,24.075565) 13 : (5.494232,6.638853) 15 : (-18.813566,-28.975445) 17 : (-10.612726,-11.403031) 19 : (30.455154,48.39125) 21 : (0.6439582,-8.811903e-2) 23 : (0.44115293,1.8800209) 25 : (-1.081647,4.433939) 29 : (3.649135,0.9225111) 30 : (-3.5161483,0.26748943) 31 : (6.247751,4.1066437) 33 : (-19.144558,-21.607367) 34 : (14.241796,-0.3949709) 35 : (-6.4786077,-4.0578346) 36 : (1.6614412,9.047534) 37 : (-9.929752e-2,-8.920741) 38 : (-0.5181453,-14.03962) 41 : (17.492886,5.483637) 42 : (1.5826802,1.5337367) 43 : (-22.710932,-26.35552) 45 : (7.819425,8.852381) 47 : (3.8250275,-1.1689825) 48 : (31.711973,36.747433) 49 : (5.4925137,10.268168) 52 : (-10.457833,-12.00074) 53 : (22.555317,34.491005) 54 : (-13.917394,-17.875317) 57 : (3.446729,-6.599143) 60 : (-9.107978,3.4590158) 62 : (-24.056997,-29.912) 63 : (2.436757,3.1981812) 64 : (-1.2618066,1.2730389) 68 : (28.439875,36.614067) 70 : (0.5847907,2.280851) 74 : (-2.3531268,-4.332817) 75 : (4.663379,8.0118885) 78 : (7.195462,14.593959) 79 : (-5.2660117,-12.242489) 80 : (-15.816689,-18.658928) 81 : (12.112614,8.826111) 82 : (14.143523,18.241121) 83 : (-25.847208,-30.473446) 84 : (11.379544,4.809246) 86 : (15.708036,29.72469) 87 : (4.8327255,3.5389404) 91 : (-12.49356,-7.1337805) 92 : (-3.2796116,-1.6790586) 93 : (-8.711067,-17.377827) 94 : (-21.488873,-14.433965) 97 : (2.3867311,-1.5279217) 98 : (5.4814205,-2.1296844) 99 : (-3.566555,-5.9053173) 100 : (13.362963,10.301908) 101 : (1.7501,-3.362393) 102 : (-1.8447578,-11.294733) 103 : (3.365004,10.789146) 105 : (-34.87906,-40.80436) 106 : (-12.686344,-15.8895645) 107 : (9.183949,7.9775457) 110 : (-22.573433,-12.892656) 112 : (12.944003,17.68826) 113 : (-20.14838,-21.692518) 114 : (-0.13564283,9.673411) 117 : (-34.568615,-36.956146) 118 : (-9.420436,-4.6167736) 125 : (-3.868143,-6.226729) 126 : (-24.039621,-25.80162) 127 : (-3.2252026,-4.7092633) 128 : (-9.503313,-5.4460926) 133 : (3.8282223,-1.7425342) 134 : (14.974166,34.86072) 135 : (-19.844137,-21.048025) 137 : (23.145348,28.191246) 139 : (3.5891905,9.721224) 142 : (0.5852886,1.3669834) 144 : (-5.7431865,5.893752) 145 : (13.187965,12.4972515) 147 : (-2.4032655,-9.138004) 149 : (22.993021,24.544422) 157 : (-5.1877947,-6.014868) 159 : (-17.272867,-16.517113) 160 : (-29.876955,-40.23668) 161 : (-16.822813,-12.472164) 162 : (-0.6595129,0.25787354) 164 : (35.51503,35.609394) 165 : (-23.43607,-30.415709) 166 : (9.842515,2.944377) 167 : (24.214361,29.503002) 168 : (-23.579342,-39.842453) 170 : (11.822997,18.28223) 171 : (16.668018,21.228556) 173 : (-18.572968,-19.739588) 174 : (5.4933777,-0.5577693) 175 : (1.9450028,4.1181507) 177 : (-19.47439,-19.676298) 179 : (-12.430883,-16.573708) 182 : (-4.7336774,-9.151844) 184 : (-2.7646563,9.710753) 185 : (22.779469,20.718946) 187 : (-25.819782,-30.222664) 188 : (18.511953,21.633574) 189 : (-19.708344,-23.975298) 191 : (17.08098,24.394087) 193 : (-3.0513897,-0.6075697) 195 : (-8.187313,-5.181074) 197 : (33.65944,40.2564) 198 : (-0.64326054,-4.086837) 199 : (-10.554681,-12.706717) 200 : (18.93743,29.3177) 202 : (-5.301973,-15.005705) 208 : (-7.2508016,-14.100331) 209 : (-19.64536,-23.58665) 211 : (-3.6678975,4.9338455) 214 : (-4.1849194,-7.2833357) 215 : (-1.1494977,-7.4395123) 217 : (-2.6624355,11.72216) 218 : (-6.4984765,-9.903734) 222 : (0.2119419,-2.0705266) 226 : (-4.751293,11.307108) 231 : (13.396966,13.482294) 232 : (-10.148484,-9.455285) 233 : (-11.613926,-30.141973) 235 : (-4.1457195,-11.701864) 236 : (22.841429,27.695446) 237 : (20.703121,28.321404) 238 : (2.2251,-9.911165) 240 : (4.6583896,13.250011) 242 : (0.56912243,1.7683926) 248 : (-13.757292,-6.036418) 250 : (-2.0742264,-11.74327) 251 : (-22.361734,-21.731167) 252 : (-4.5171075,-6.9133253) 258 : (-15.887733,-15.204248) 259 : (13.085469,7.5854363) 260 : (17.63313,21.100315) 261 : (7.1418476,0.2580099) 262 : (-14.919332,-23.728527) 263 : (24.858322,28.005262) 266 : (-0.1598835,1.6914234) 267 : (-11.6540985,-19.327158) 270 : (-9.534692,-15.585428) 273 : (23.928104,34.40332) 276 : (12.787605,5.514979) 279 : (0.36071712,-6.126135) 281 : (-6.324025,-4.401108) 284 : (4.8829827,6.8221273) 285 : (-20.047634,-17.415882) 287 : (-6.266363,-7.5843716) 292 : (31.943773,28.52203) 294 : (4.4730716,17.863426) 295 : (-24.903772,-31.832272) 296 : (23.457853,27.188269) 298 : (-5.066526e-2,3.090138) 299 : (-12.440723,-12.220831) 300 : (10.800417,2.0174663) 302 : (21.627502,25.618221) 304 : (-19.292229,-21.6833) 307 : (-7.7303686,-6.4778433) 308 : (16.438334,17.45433) 309 : (18.270615,16.974281) 313 : (-18.940536,-14.294319) 315 : (1.1139888,-9.944632)

fold-2d-product: Ok fold-2d-maximum: Ok fold-2d-minimum: Ok foldseg-sum: Ok scanseg-sum: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil-1D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil-2D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil-3D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil-3x3-cross: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil-3x3-pair: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

stencil2-2D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

permute-hist: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

backpermute-reverse: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

backpermute-transpose: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

init: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

tail: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

take: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

drop: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

slit: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

gather: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

gather-if: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

scatter: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

scatter-if: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

sasum: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

saxpy: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

dotp: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

filter: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

smvm: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

black-scholes: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

radixsort: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

io: test: fromPtr Int test: fromPtr (Int,Double) test: toPtr Int16 test: toPtr Int32 test: toPtr Int64 test: fromArray Int Ok io: +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. Ok canny: Failed: no image file specified integral-image: Failed: no image file specified slices: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

slices: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

slices: ^C[ 3364.241184] [WARN]Received Interrupt signal. Failed: user interrupt slices: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: unspecified launch failure

sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok bound-variables: Ok accelerate-examples: forkOS_entry: interrupted neil@debian-neil:~/.cabal/bin$

On 28/05/13 23:00, Trevor L. McDonell wrote:

Neil, could you try again with the latest version? I managed to create a setup that threw an invalid context error, so the fix for that might help in your situation as well.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18548549.

tmcdonell commented 11 years ago

Oops, sorry for all the compilation failures with even. I'm currently hacking on nofib to produce this test case for the context bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what we had after the hack to replace forkOS with forkOn 0? That would at least be some progress!

For the "unspecified launch failure errors", we might be trying to launch a kernel that requires more resources than your card provides. Since I haven't tested on an Optimus card before, there might be bugs in the occupancy calculator code.

Try the following?

import Prelude                          as P
import Data.Array.Accelerate            as A
import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float)
xs = use $ fromList (Z:.10) [0..]
ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float)
dotp xs ys
  = A.fold (+) 0
  $ A.zipWith (*) xs ys

main :: IO ()
main
  = withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"]
  $ print
  $ run (dotp xs ys)

You'll need to have installed accelerate-cuda with the -fdebug flag, or just run in ghci using the script in the utils directory (you might have to edit it a bit depending on where you have checked out the individual repositories).

neiljamieso commented 11 years ago

Thanks Trevor,

I'll try that. The fix you did for the invalid context (last email) - was that in cuda or accelerate-cuda? I only rebuilt accelerate-cuda (and dependencies).

The "unspecified launch failure errors" were in the "forkOS" version.
The "forkOn 0" version had the context errors.

I did put the forkOn back in, but not sure I rebuilt the whole sequence properly afterwards.

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

Oops, sorry for all the compilation failures with |even|. I'm currently hacking on |nofib| to produce this test case for the context bug, but the local changes aren't ready to push upstream.

Are these the same errors you had initially? This looks more like what we had after the hack to replace |forkOS| with |forkOn 0|? That would at least be some progress!

For the "unspecified launch failure errors", we might be trying to launch a kernel that requires more resources than your card provides. Since I haven't tested on an Optimus card before, there might be bugs in the occupancy calculator code.

Try the following?

import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float) xs = use $ fromList (Z:.10) [0..] ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = A.fold (+) 0 $ A.zipWith (*) xs ys

main :: IO () main = withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"] $ print $ run (dotp xs ys)

You'll need to have installed |accelerate-cuda| with the |-fdebug| flag, or just run in |ghci| using the script in the |utils| directory (you might have to edit it a bit depending on where you have checked out the individual repositories).

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18655607.

neiljamieso commented 11 years ago

Hi Trevor,

This is the output. Are you able to make sense of it? Certainly seems to have worked!

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k -- -fdump-cc 0.03:gc: initialise default context 0.07:gc: initialise context #0x00007f8f1c00b4f0 Device 0: Quadro K1000M (compute capatability 3.0) 1 multiprocessors @ 850.50 MHz (192 cores), 2 GB global memory 0.07:gc: push context: #0x00007f8f1c00b4f0 0.07:gc: initialise CUDA state 0.07:gc: initialise memory table 0.07:cc: initialise kernel table 0.07:cc: persist/restore: 39 entries 0.08:gc: lookup/not found: Array #32 0.08:gc: useArray/malloc: 40 B 0.08:gc: malloc/new 0.08:gc: insert: Array #32 0.08:gc: lookup/not found: Array #31 0.08:gc: useArray/malloc: 40 B 0.08:gc: malloc/new 0.08:gc: insert: Array #31 0.08:cc: (3.0,"\178\140cp$\ACK\226\229\195l\184eF`f3")

include

extern "C" global void foldAll(const DIM1 shIn0, const float* restrict arrIn0_a0, const DIM1 shIn1, const float* restrict arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0) { extern volatile shared float sdata0[]; float x0; float y0; const Int64 sh0 = min((Int64) shIn0, (Int64) shIn1); const int shapeSize = sh0; const int gridSize = blockDim.x * gridDim.x; int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) { const Int64 v2 = ix; const int v3 = toIndex(shIn0, shape(v2)); const int v4 = toIndex(shIn1, shape(v2));

y0 = arrIn0_a0[v3] * arrIn1_a0[v4]; for (ix += gridSize; ix < shapeSize; ix += gridSize) { const Int64 v2 = ix; const int v3 = toIndex(shIn0, shape(v2)); const int v4 = toIndex(shIn1, shape(v2));

x0 = arrIn0_a0[v3] * arrIn1_a0[v4]; y0 = x0 + y0; } } sdata0[threadIdx.x] = y0; syncthreads(); ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x); if (threadIdx.x + 512 < ix) { x0 = sdata0[threadIdx.x + 512]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 256 < ix) { x0 = sdata0[threadIdx.x + 256]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 128 < ix) { x0 = sdata0[threadIdx.x + 128]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 64 < ix) { x0 = sdata0[threadIdx.x + 64]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } __syncthreads(); if (threadIdx.x < 32) { if (threadIdx.x + 32 < ix) { x0 = sdata0[threadIdx.x + 32]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 16 < ix) { x0 = sdata0[threadIdx.x + 16]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 8 < ix) { x0 = sdata0[threadIdx.x + 8]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 4 < ix) { x0 = sdata0[threadIdx.x + 4]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 2 < ix) { x0 = sdata0[threadIdx.x + 2]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 1 < ix) { x0 = sdata0[threadIdx.x + 1]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } } if (threadIdx.x == 0) { if (shapeSize > 0) { if (gridDim.x == 1) { x0 = 0.0f; y0 = x0 + y0; } arrOut_a0[blockIdx.x] = y0; } else { arrOut_a0[blockIdx.x] = 0.0f; } } }

0.08:cc: (3.0,"\209\181\149\254\136cnX\DEL\171\b\219\160\133\133:")

include

extern "C" global void foldAll(const DIM1 shIn0, const float* restrict arrIn0_a0, const DIM1 shIn1, const float* restrict arrIn1_a0, const DIM0 shOut, float* restrict arrOut_a0, const DIM1 shRec, const float* restrict arrRec_a0) { extern volatile shared float sdata0[]; float x0; float y0; const Int64 sh0 = shRec; const int shapeSize = sh0; const int gridSize = blockDim.x * gridDim.x; int ix = blockDim.x * blockIdx.x + threadIdx.x;

if (ix < shapeSize) { y0 = arrRec_a0[ix]; for (ix += gridSize; ix < shapeSize; ix += gridSize) { x0 = arrRec_a0[ix]; y0 = x0 + y0; } } sdata0[threadIdx.x] = y0; syncthreads(); ix = min(shapeSize - blockIdx.x * blockDim.x, blockDim.x); if (threadIdx.x + 512 < ix) { x0 = sdata0[threadIdx.x + 512]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 256 < ix) { x0 = sdata0[threadIdx.x + 256]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 128 < ix) { x0 = sdata0[threadIdx.x + 128]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } syncthreads(); if (threadIdx.x + 64 < ix) { x0 = sdata0[threadIdx.x + 64]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } __syncthreads(); if (threadIdx.x < 32) { if (threadIdx.x + 32 < ix) { x0 = sdata0[threadIdx.x + 32]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 16 < ix) { x0 = sdata0[threadIdx.x + 16]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 8 < ix) { x0 = sdata0[threadIdx.x + 8]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 4 < ix) { x0 = sdata0[threadIdx.x + 4]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 2 < ix) { x0 = sdata0[threadIdx.x + 2]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } if (threadIdx.x + 1 < ix) { x0 = sdata0[threadIdx.x + 1]; y0 = y0 + x0; sdata0[threadIdx.x] = y0; } } if (threadIdx.x == 0) { if (shapeSize > 0) { if (gridDim.x == 1) { x0 = 0.0f; y0 = x0 + y0; } arrOut_a0[blockIdx.x] = y0; } else { arrOut_a0[blockIdx.x] = 0.0f; } } }

0.09:cc: waiting for nvcc... 0.09:cc: queue: 19.000 µs, execute: 1.316 s ... /usr/bin/nvcc -I /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-12605/dragon12606.cubin -O3 -m64 /tmp/accelerate-cuda-12605/dragon12606.cu 0.09:cc: queue: 32.000 µs, execute: 1.319 s ... /usr/bin/nvcc -I /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits -arch=sm_30 -cubin -o /tmp/accelerate-cuda-12605/dragon12605.cubin -O3 -m64 /tmp/accelerate-cuda-12605/dragon12605.cu 0.09:cc: persist/save: /home/neil/.accelerate/accelerate-cuda-0.14.0.0/cache/3.0/z33Ufz60UFezr184lzr195zr229zr226zrACKzdpczr140zr178 0.09:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem ... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks 0.09:gc: lookup/not found: Array #25 0.09:gc: mallocArray: 4 B 0.09:gc: malloc/new 0.09:gc: insert: Array #25 0.09:gc: lookup/found: Array #32 0.09:gc: lookup/found: Array #31 0.09:gc: lookup/found: Array #25 0.09:exec: foldAll<<< 1, 1024, 4096 >>> gpu: 48.128 µs, cpu: 0.000 s 0.09:gc: lookup/found: Array #25 0.09:gc: peekArray: 4 B 0.09:gc: pop context: #0x00007f8f1c00b4f0 Array (Z) [660.0] neil@debian-neil:~/.cabal/bin$

Cheers, Neil

On 30/05/13 10:54, Trevor L. McDonell wrote:

import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float) xs = use $ fromList (Z:.10) [0..] ys = use $ fromList (Z:.10) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = A.fold (+) 0 $ A.zipWith (*) xs ys

main :: IO () main = withArgs ["-ddump-cc", "-ddump-gc", "-ddump-exec", "-dverbose"] $ print $ run (dotp xs ys)

tmcdonell commented 11 years ago

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

Could you run the deviceQueryDrv program from the CUDA SDK examples and show me the output?
I've not encountered an Optimus/Quadro device before, so my current thought is that something is wrong with the calculations that determine how many threads to launch. Try increasing the vector sizes for the test program I sent last time and find the point where it fails. Multiples of 1024 are probably a good increment. Feel free to comment out the line beginning withArgs so that it is less chatty.

Thanks!

tmcdonell commented 11 years ago

Oh, also, did you need to edit Async.hs to use forkOn 0 after I pushed the latest patches, or were the previous results with a clean checkout?

neiljamieso commented 11 years ago

The forkOn 0 no longer makes any difference - i.e all now fail as it did with forkOS.

I'll try the suggestion about cranking up the size of the vectors and get back.

Neil

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I pushed the latest patches, or were the previous results with a clean checkout?

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18724824.

neiljamieso commented 11 years ago

Here's the deviceQueryDrv output:

neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$ optirun ./deviceQueryDrv ./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version Detected 1 CUDA Capable device(s)

Device 0: "Quadro K1000M" CUDA Driver Version: 5.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2048 MBytes (2147287040 bytes) ( 1) Multiprocessors x (192) CUDA Cores/MP: 192 CUDA Cores GPU Clock rate: 851 MHz (0.85 GHz) Memory Clock rate: 900 Mhz Memory Bus Width: 128-bit L2 Cache Size: 262144 bytes Max Texture Dimension Sizes 1D=(65536) 2D=(65536,65536) 3D=(4096,4096,4096) Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Texture alignment: 512 bytes Maximum memory pitch: 2147483647 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > neil@debian-neil:~/Documents/Computing/CUDASamples/1_Utilities/deviceQueryDrv$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

*
Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
*
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.
Thanks!

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18724768.

neiljamieso commented 11 years ago

Remarkably durable...

This is the code

import Prelude as P import Data.Array.Accelerate as A import Data.Array.Accelerate.CUDA

import System.Environment

xs, ys :: Acc (Vector Float) xs = use $ fromList (Z:.1000000) [0..] ys = use $ fromList (Z:.1000000) [2,4..]

dotp :: Acc (Vector Float) -> Acc (Vector Float) -> Acc (Scalar Float) dotp xs ys = A.fold (+) 0 $ A.zipWith (*) xs ys

main :: IO () main = withArgs ["-ddump-cc"{--, "-ddump-gc", "-ddump-exec", "-dverbose"--}] $ print $ run (dotp xs ys)

And this is the output

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k 0.12:cc: initialise kernel table 0.12:cc: persist/restore: 41 entries 0.18:cc: found/persistent 0.18:cc: found/persistent 0.18:cc: entry function 'foldAll' used 11 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem ... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks 0.18:cc: entry function 'foldAll' used 8 registers, 0 bytes smem, 0 bytes lmem, 0 bytes cmem ... multiprocessor occupancy 100.0% : 2048 threads over 64 warps in 2 blocks Array (Z) [6.666666e17] neil@debian-neil:~/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

*
Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
*
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.
Thanks!

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18724768.

neiljamieso commented 11 years ago

Changing vector sizes to this...

xs = use $ fromList (Z:.1000000000) [0..] ys = use $ fromList (Z:.1000000000) [2,4..]

Lead to a perfectly reasonable... neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda -k 39.85:cc: initialise kernel table 39.85:cc: persist/restore: 41 entries accelerate-examples: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: out of memory

neil@debian-neil:~/.cabal/bin$

On 31/05/13 16:03, Trevor L. McDonell wrote:

Hi Neil,

hmm, it does indeed seem to have worked. Okay, a couple more things to try, if you don't mind:

*
Could you run the |deviceQueryDrv| program from the CUDA SDK
examples and show me the output?
*
I've not encountered an Optimus/Quadro device before, so my
current thought is that something is wrong with the calculations
that determine how many threads to launch. Try increasing the
vector sizes for the test program I sent last time and find the
point where it fails. Multiples of 1024 are probably a good
increment. Feel free to comment out the line beginning |withArgs|
so that it is less chatty.
Thanks!

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18724768.

neiljamieso commented 11 years ago

Does the simple dotp example exercise the Async module? This seems to be the source of the crashes.

Cheers, Neil

neiljamieso commented 11 years ago

Oops, Sorry a mis-type there. They are with forkIO (not forkOS). I tried with forkOn 0 and got the same results. Previously forkOn 0 gave more successes and failed with "bad context" message rather than "launch failed". "launch failed" has always happened with forkIO.

On 31/05/13 16:05, Trevor L. McDonell wrote:

Oh, also, did you need to edit |Async.hs| to use |forkOn 0| after I pushed the latest patches, or were the previous results with a clean checkout?

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-18724824.

neiljamieso commented 11 years ago

Hi Trevor,

I mentioned this before, but it may have been lost, and is more of a worry... The fourth slice example fails to terminate (after 40sec). I have to use Ctrl-C to kill it. I'm not sure why this has changed.

Cheers, neil

neiljamieso commented 11 years ago

Hi Trevor,

I thought you might be interested in this. Running the regression test script seems to work - no crashing, no stalling on the slices!

neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$ optirun ./regression_test.sh --cuda

First the main battery of tests:

running with CUDA backend to display available options, rerun with '--help'

map-abs: Ok map-plus: Ok map-square: Ok zip: Ok zipWith-plus: Ok fold-sum: Ok fold-product: Ok fold-maximum: Ok fold-minimum: Ok fold-2d-sum: Ok fold-2d-product: Ok fold-2d-maximum: Ok fold-2d-minimum: Ok foldseg-sum: Ok scanseg-sum: Failed:

0 : (0.0,NaN) 1 : (-0.6929801,-4.2535293e37) 2 : (-1.2756131,NaN) 3 : (-0.9977418,NaN) 4 : (-1.1877143,NaN) 5 : (-1.4590598,NaN) 6 : (-1.465081,NaN) 7 : (-1.5335276,NaN) 8 : (-1.8964667,NaN) 9 : (-2.429172,NaN) 11 : (0.9855077,0.0) 12 : (1.2848983,0.0) 14 : (0.9586575,0.0) 15 : (0.8935447,0.0) 16 : (0.55967414,0.0) 17 : (0.7870643,0.0) 18 : (0.38397616,0.0) 19 : (0.5038597,0.0) 20 : (1.0932949,0.0) 22 : (-0.7802813,0.0) 23 : (-0.90180016,0.0) 24 : (-1.1760286,0.0) 25 : (-0.66521347,0.0) 27 : (0.8123276,0.0) 28 : (1.6648452,0.0) 29 : (1.8714409,0.0) 30 : (1.5091901,0.0) 31 : (2.096872,0.0) 32 : (2.3554232,0.0) 34 : (-0.82877505,0.0) 35 : (-1.8104537,0.0) 36 : (-1.8511171,0.0) 37 : (-1.4023463,0.0) 38 : (-2.062095,0.0) 39 : (-1.5179899,0.0) 40 : (-0.57485485,0.0) 41 : (-1.3017156,0.0) 43 : (-0.56559163,0.0) 44 : (-0.8005209,0.0) 45 : (-0.26718092,0.0) 47 : (-0.42379427,0.0) 48 : (-0.6211059,0.0) 49 : (-1.3470457,0.0) 50 : (-2.2204418,0.0) 51 : (-1.9068379,0.0) 52 : (-2.0748498,0.0) 53 : (-1.0756776,0.0) 54 : (-1.121619,0.0) 55 : (-1.9701061,0.0) 57 : (-0.3139459,0.0) 58 : (-0.46075392,0.0) 59 : (0.50402975,0.0) 60 : (-0.27072406,0.0) 61 : (-0.49237812,0.0) 62 : (-1.2419014,0.0) 63 : (-2.084043,0.0)

stencil-1D: Ok stencil-2D: Ok stencil-3D: Ok stencil-3x3-cross: Ok stencil-3x3-pair: Ok stencil2-2D: Ok permute-hist: Ok backpermute-reverse: Ok backpermute-transpose: Ok init: Ok tail: Ok take: Ok drop: Ok slit: Ok gather: Ok gather-if: Ok scatter: Ok scatter-if: Ok sasum: Ok saxpy: Ok dotp: Ok filter: Ok smvm: Ok black-scholes: Ok radixsort: Ok io: test: fromPtr Int test: fromPtr (Int,Double) test: toPtr Int16 test: toPtr Int32 test: toPtr Int64 test: fromArray Int Ok io: +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. Ok canny: Failed: no image file specified integral-image: Failed: no image file specified slices: Ok slices: Ok slices: Ok slices: Ok slices: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok bound-variables: Ok

Next, additional application tests, beginning with mandelbrot:

accelerate-mandelbrot (c) [2011..2013] The Accelerate Team

Usage: accelerate-mandelbrot [OPTIONS]

Available backends: interpreter reference implementation (sequential)

cuda implementation for NVIDIA GPUs (parallel)

Runtime usage: arrows translate display z ; zoom in x q zoom out f single precision calculations d double precision calculations (if supported)

Error: unrecognized option `--size=64'

Run "accelerate-mandelbrot --help" for usage information neil@debian-neil:~/Downloads/Accelerate_130529/accelerate-examples-master/accelerate-examples-master$

neiljamieso commented 11 years ago

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the moment. Using the regressions script I saw the --size option and tried it out. The accelerate-examples work with --size=1024, and fail with --size=2048 (with the "launch failure" message). So this seems to be a size problem rather than some basic fault in the context or launch process. I suspect your thoughts about the calculations for memory usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation with the following, but it still slows down and misses frames as the animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64 --sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with the

neiljamieso commented 11 years ago

Hi Trevor,

I hope you don't mind me sending lots, but I am on a roll at the moment. Using the regressions script I saw the --size option and tried it out. The accelerate-examples work with --size=1024, and fail with --size=2048 (with the "launch failure" message). So this seems to be a size problem rather than some basic fault in the context or launch process. I suspect your thoughts about the calculations for memory usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

There are numerous (hundreds of) "fails" in the results not matching the interpreter result in scanseg-sum (but it ran!).

:-) Neil

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

smoothlife chokes on the default settings. I get a decent animation with the following, but it still slows down and misses frames as the animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64 --sigmode=2 --sigtype=Smooth --framerate=5

Pretty happy now!

Neil

All this with forkIO

tmcdonell commented 11 years ago

The forkOn 0 no longer makes any difference - i.e all now fail as it did with forkOS.

Okay, that's great! I made some changes elsewhere tried to do the same thing but not fixed to CPU zero, so am glad that that works. One problem down!

tmcdonell commented 11 years ago

Does the simple dotp example exercise the Async module? This seems to be the source of the crashes.

Yes, all run invocations will go via Async. I think we fixed the problem there, and the failures now are related to the kernel launches.

tmcdonell commented 11 years ago

Hi Neil,

I hope you don't mind me sending lots, but I am on a roll at the moment.

Not at all, it is all very useful information (:

Using the regressions script I saw the --size option and tried it out. The accelerate-examples work with --size=1024, and fail with --size=2048 (with the "launch failure" message). So this seems to be a size problem rather than some basic fault in the context or launch process. I suspect your thoughts about the calculations for memory usage being wrong are correct.

Actually I can be more specific... 1024 works, 1025 fails.

Ah, that is very helpful, thanks! I'll play around and see if I can dig up anymore leads to follow.

There are numerous (hundreds of) "fails" in the results not matching the interpreter result in scanseg-sum (but it ran!).

A little worrying, but at least it runs! We'll get to that one later (:

Also fluid, mandrebrot, etc all run fine. Haven't tried the hashcat.

Great!

For hashcat you'll need to find a list of plain text words to feed it, and then a bunch of MD5 digests guess. You can use a standard dictionary like /usr/share/dict/english, although for a bit of fun Google for the rockyou list and a list of unknown md5's (:

smoothlife chokes on the default settings. I get a decent animation with the following, but it still slows down and misses frames as the animation progresses. I assume this is a result of having a low end GPU.

neil@debian-neil:~/.cabal/bin$ ./accelerate-smoothlife --cuda --size=64 --sigmode=2 --sigtype=Smooth --framerate=5

I think it depends on whether or not accelerate-fft built against the fast CUDA FFT library implementation. I don't think there is an easy way to check whether this happened or not, aside from just running and measuring the speed. Try:

cabal install accelerate-fft -fcuda

Or just install it after the accelerate-cuda package is already installed. This should probably have better documentation!

-Trev

neiljamieso commented 11 years ago

On 03/06/13 16:04, Trevor L. McDonell wrote:

cabal install accelerate-fft -fcuda Worked! Smoothlife now works beautifully. Amazing speedup in processing.

tmcdonell commented 10 years ago

@neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here?

neiljamieso commented 10 years ago

Hi Trev,

How recent a download from Github do I need?

Neil

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here?

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-28543984.

neiljamieso commented 10 years ago

Hi Trev,

I tried installing the latest accelerate stuff from githib.

The latest accelerate-cuda depends on cuda-1.5.1.1 - the latest cuda in github is 1.5.1.0

On 15/11/13 16:02, Trevor L. McDonell wrote:

@neiljamieso https://github.com/neiljamieso does everything work fine now? Some recent fixes to the fold kernel means that those tests should pass now. Do you still have any problems here?

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-28543984.

mchakravarty commented 10 years ago

@neiljamieso Trev probably forgot to push the version bump. Just change the version in cuda.cabal to 1.5.1.1 and it'll work.

neiljamieso commented 10 years ago

No working so well. I have attached the outputs (with my command line at the front) for standard and verbose outputs.

Neil

On 16/11/13 23:41, Manuel M T Chakravarty wrote:

@neiljamieso https://github.com/neiljamieso Trev probably forgot to push the version bump. Just change the version in |cuda.cabal| to 1.5.1.1 and it'll work.

— Reply to this email directly or view it on GitHub https://github.com/AccelerateHS/accelerate/issues/92#issuecomment-28624089.

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 -v > verbose_test_131117 /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [28,0,0] Assertion ix >= 0 && ix < sh failed. /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [29,0,0] Assertion ix >= 0 && ix < sh failed. /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [30,0,0] Assertion ix >= 0 && ix < sh failed. /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed. accelerate-examples: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

***_OUTPUT_****

running with CUDA backend to display available options, rerun with '--help'

map-abs: Ok map-plus: Ok map-square: Ok zip: Ok zipWith-plus: Ok fold-sum: Ok fold-product: Ok fold-maximum: Ok fold-minimum: Ok fold-2d-sum: Ok fold-2d-product: Ok fold-2d-maximum: Ok fold-2d-minimum: Ok foldseg-sum: Ok scanseg-sum: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-1D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-2D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-cross: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil-3x3-pair: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

stencil2-2D: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

permute-hist: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-reverse: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

backpermute-transpose: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

init: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

tail: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

take: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

drop: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slit: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

gather-if: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

scatter-if: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sasum: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

saxpy: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

dotp: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

filter: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

smvm: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

black-scholes: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

radixsort: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

io: test: fromPtr Int test: fromPtr (Int,Double) test: toPtr Int16 test: toPtr Int32 test: toPtr Int64 test: fromArray Int Ok io: +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. +++ OK, passed 100 tests. Ok canny: Failed: no image file specified integral-image: Failed: no image file specified slices: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

slices: Failed: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok sharing-recovery: Ok bound-variables: Ok

warming up estimating clock resolution... mean is 4.154538 us (160001 iterations) found 1231 outliers among 159999 samples (0.8%) 1094 (0.7%) high severe estimating cost of a clock call... mean is 83.69922 ns (32 iterations) found 4 outliers among 32 samples (12.5%) 3 (9.4%) low mild 1 (3.1%) high mild

benchmarking map-abs

neil@debian-neil:~/.cabal/bin$ optirun ./accelerate-examples --cuda --size=1024 > bare_test_131117 /home/neil/.cabal/share/accelerate-cuda-0.14.0.0/cubits/accelerate_cuda_shape.h:287: int toIndex(Shape, Shape) [with Shape = int]: block: [0,0,0], thread: [31,0,0] Assertion ix >= 0 && ix < sh failed. accelerate-examples: * Internal error in package accelerate * *\ Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues ./Data/Array/Accelerate/CUDA/State.hs:87 (unhandled): CUDA Exception: device-side assert triggered

neil@debian-neil:~/.cabal/bin$

*OUTPUT***