HannoSpreeuw / Kernel-tuning-for-Sagecal

Individual kernels from Sagecal are tuned here
0 stars 1 forks source link

tune_kernel_coherencies sometimes gives out of memory error #8

Open benvanwerkhoven opened 6 years ago

benvanwerkhoven commented 6 years ago

@HannoSpreeuw

./tune_kernel_coherencies.py geeft vaak ........ pycuda._driver.MemoryError: cuModuleLoadData failed: out of memory - PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

We need to find what exactly the problem is here. What version of Kernel Tuner are you using at the moment?

HannoSpreeuw commented 6 years ago

0.1.6

benvanwerkhoven commented 6 years ago

Hmm that's odd, I'll see if I can reproduce. Any clues on when this occurs?

HannoSpreeuw commented 6 years ago

Especially with a blocksize of 1024, as far as I can remember. That is why I had the kernel tuner use smaller block sizes.

benvanwerkhoven commented 6 years ago

I'm getting the following:

./tune_kernel_coherencies.py N 61 B 366000 T 200 K 150 F 10 Using: GeForce GTX TITAN X With slave kernel: Using: GeForce GTX TITAN X use_kernel=1, block_size_x=32, time=7349.65849609 best performing configuration: use_kernel=1, block_size_x=32, time=7349.65849609 Without slave kernel: Using: GeForce GTX TITAN X block_size_x=32, use_kernel=0, time=53.7468734741 block_size_x=64, use_kernel=0, time=53.6964294434 block_size_x=128, use_kernel=0, time=53.7446594238 block_size_x=256, use_kernel=0, time=65.9320968628 block_size_x=512, use_kernel=0, time=66.2752319336 skipping config kernel_coherencies_1024_0 reason: too many resources requested for launch best performing configuration: block_size_x=64, use_kernel=0, time=53.6964294434

it seems 1024 can't run because it uses too many registers, but that's not the same as the Out of Memory error you showed me earlier.

HannoSpreeuw commented 6 years ago

I think you are referring to my email on July 7th. I was running tune_kernel_array_beam_slave_sincos.py. Your fix worked. I replied: " Dank je! Ik had ondertussen al die opties geprobeerd , behalve het weghalen van extern voor shared . Dat was het! " I don't think it had to do with kernel_coherencies.py.

benvanwerkhoven commented 6 years ago

I was actually following up what you said on slack on the 14th of September. But I can't seem to reproduce a memory error on tune_kernel_coherencies.py

Is it possible that you were running multiple things on that GPU at the same time?

HannoSpreeuw commented 6 years ago

Ah, sorry, I'll check what I posted on Slack.

HannoSpreeuw commented 6 years ago

I am having a hard time to reproduce that warning. I spent an hour or two on node050 to do so. I mentioned it not occurring repeatedly, with the same settings.

Unfortunately I did not freeze the tune_kernel_coherencies.py script when it occurred.