IDSIA / brainstorm

Fast, flexible and fun neural networks.
Other
1.3k stars 152 forks source link

PyCudaHandler is broken #38

Closed flukeskywalker closed 8 years ago

flukeskywalker commented 8 years ago

The recent added functionality seems to have broken the PyCudaHandler (including the tests). Running the tests results in:

Traceback (most recent call last):
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/_pytest/config.py", line 543, in importconftest
    mod = conftestpath.pyimport()
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/py/_path/local.py", line 650, in pyimport
    __import__(modname)
  File "/home/arkade/Dropbox/codes/brainstorm/test/conftest.py", line 7, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/__init__.py", line 5, in <module>
    from brainstorm.structure import *
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/__init__.py", line 4, in <module>
    from brainstorm.structure.network import Network
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 11, in <module>
    from brainstorm.structure.buffers import BufferManager
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/buffers.py", line 6, in <module>
    from brainstorm.handlers import default_handler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/__init__.py", line 10, in <module>
    from brainstorm.handlers.pycuda_handler import PyCudaHandler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 40, in <module>
    class PyCudaHandler(Handler):
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 79, in PyCudaHandler
    array_type = pycuda.gpuarray.GPUArray
NameError: name 'pycuda' is not defined

Changing L79 in pycuda_handler.py to array_type = gpuarray.GPUArray leads to the bigger problem:

Traceback (most recent call last):
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/_pytest/config.py", line 543, in importconftest
    mod = conftestpath.pyimport()
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/py/_path/local.py", line 650, in pyimport
    __import__(modname)
  File "/home/arkade/Dropbox/codes/brainstorm/test/conftest.py", line 7, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/__init__.py", line 5, in <module>
    from brainstorm.structure import *
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/__init__.py", line 4, in <module>
    from brainstorm.structure.network import Network
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 11, in <module>
    from brainstorm.structure.buffers import BufferManager
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/buffers.py", line 6, in <module>
    from brainstorm.handlers import default_handler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/__init__.py", line 10, in <module>
    from brainstorm.handlers.pycuda_handler import PyCudaHandler
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 627, in <module>
    _mod = SourceModule(__softmax_kernel_code)
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/pycuda/compiler.py", line 262, in __init__
    self.module = module_from_buffer(cubin)
LogicError: cuModuleLoadDataEx failed: invalid device context -
untom commented 8 years ago

Totally my fault :D Should be fixed now. Added bonus: you can now choose which GPU to use and the GPU uses a memory manager.

flukeskywalker commented 8 years ago

@untom, thanks for the quick fix! It looks like you forgot to push the regression layer though since now I get:

Traceback (most recent call last):
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/_pytest/config.py", line 543, in importconftest
    mod = conftestpath.pyimport()
  File "/home/arkade/venv/py2/local/lib/python2.7/site-packages/py/_path/local.py", line 650, in pyimport
    __import__(modname)
  File "/home/arkade/Dropbox/codes/brainstorm/test/conftest.py", line 7, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/__init__.py", line 5, in <module>
    from brainstorm.structure import *
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/__init__.py", line 4, in <module>
    from brainstorm.structure.network import Network
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/network.py", line 8, in <module>
    from brainstorm.structure.architecture import (
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/structure/architecture.py", line 12, in <module>
    from brainstorm.layers.base_layer import get_layer_class_from_typename
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/layers/__init__.py", line 13, in <module>
    from brainstorm.layers.regression_layer import Regression
ImportError: No module named regression_layer
untom commented 8 years ago

Ahh, crap.... please remove that line in the meantime... the REgressionLayer doesn't currently pass all tests. Sorry for having commited that.

flukeskywalker commented 8 years ago

No worries about that, but commenting that out reveals a deeper problem:

========================================================= FAILURES ==========================================================
____________________________________________ test_limit_incoming_weights_squared ____________________________________________

    def test_limit_incoming_weights_squared():

        for orig in (np.random.rand(4, 5), np.random.randn(3, 5, 4, 6)):
            for limit in [0.00001, 1, 10, 10000]:
                x = orig.reshape(orig.shape[0], orig.size / orig.shape[0]).copy()
                divisor = (x * x).sum(axis=1, keepdims=True) ** 0.5 / limit
                divisor[divisor < 1] = 1
                out = (x / divisor).reshape(orig.shape)

                y = orig.copy()
                mod = ConstrainL2Norm(limit)
                mod(default_handler, y)
                assert np.allclose(y, out)

                handler = PyCudaHandler()
                y = handler.create_from_numpy(orig)
>               mod(handler, y)

test/test_weight_modifiers.py:25: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
brainstorm/value_modifiers.py:74: in __call__
    handler.mult_st(1 / self.limit, sq_norm ** 0.5, out=divisor)
brainstorm/handlers/pycuda_handler.py:210: in mult_st
    mult_st_kernel(a, b, out)
../../../venv/py2/local/lib/python2.7/site-packages/pycuda/elementwise.py:236: in __call__
    func.prepared_async_call(grid, block, stream, *invocation_args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

func = <pycuda._driver.Function object at 0x7f1d72660398>, grid = (1, 1), block = (32, 1, 1), stream = None
args = (99999.99999999999, <pycuda._driver.DeviceAllocation object at 0x7f1d725cebb0>, <pycuda._driver.DeviceAllocation object at 0x7f1d725ce980>, 4)
kwargs = {}

    def function_prepared_async_call(func, grid, block, stream, *args, **kwargs):
        if isinstance(block, tuple):
>           func._set_block_shape(*block)
E           LogicError: cuFuncSetBlockShape failed: invalid resource handle

../../../venv/py2/local/lib/python2.7/site-packages/pycuda/driver.py:492: LogicError
=========================================== 1 failed, 890 passed in 18.77 seconds ===========================================
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
untom commented 8 years ago

ahh crap, I tought I'd fixed that :D

untom commented 8 years ago

Damn, that was hard to debug... I couldn't find out a way to have more than one PyCudaHandler instantiated. I still don't know what to do, or how... the only workaround appears to be to change that tesetcase so that only one PyCudaHandler is instantiated.

flukeskywalker commented 8 years ago

That did not fix it though...

============================================================= FAILURES =============================================================
_______________________________________________ test_limit_incoming_weights_squared ________________________________________________

    def test_limit_incoming_weights_squared():
        handler = PyCudaHandler()
        for orig in (np.random.rand(4, 5), np.random.randn(3, 5, 4, 6)):
            for limit in [0.00001, 1, 10, 10000]:
                x = orig.reshape(orig.shape[0], orig.size / orig.shape[0]).copy()
                divisor = (x * x).sum(axis=1, keepdims=True) ** 0.5 / limit
                divisor[divisor < 1] = 1
                out = (x / divisor).reshape(orig.shape)

                y = orig.copy()
                mod = ConstrainL2Norm(limit)
                mod(default_handler, y)
                assert np.allclose(y, out)

                y = handler.create_from_numpy(orig)
>               mod(handler, y)

test/test_weight_modifiers.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
brainstorm/value_modifiers.py:74: in __call__
    handler.mult_st(1 / self.limit, sq_norm ** 0.5, out=divisor)
brainstorm/handlers/pycuda_handler.py:210: in mult_st
    mult_st_kernel(a, b, out)
../../../venv/py2/local/lib/python2.7/site-packages/pycuda/elementwise.py:236: in __call__
    func.prepared_async_call(grid, block, stream, *invocation_args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

func = <pycuda._driver.Function object at 0x7f0b26429500>, grid = (1, 1), block = (32, 1, 1), stream = None
args = (99999.99999999999, <pycuda._driver.DeviceAllocation object at 0x7f0b10057bb0>, <pycuda._driver.DeviceAllocation object at 0x7f0b10057980>, 4)
kwargs = {}

    def function_prepared_async_call(func, grid, block, stream, *args, **kwargs):
        if isinstance(block, tuple):
>           func._set_block_shape(*block)
E           LogicError: cuFuncSetBlockShape failed: invalid resource handle

../../../venv/py2/local/lib/python2.7/site-packages/pycuda/driver.py:492: LogicError
============================================== 1 failed, 890 passed in 17.69 seconds ===============================================
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/arkade/Dropbox/codes/brainstorm/brainstorm/handlers/pycuda_handler.py", line 35, in _shutdown
    _pycuda_context.pop()
AttributeError: 'NoneType' object has no attribute 'pop'
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
flukeskywalker commented 8 years ago

If I understand correctly, the motivation for the recent changes is to allow the instantiation of multiple handler instances. If so, we can also keep this as a TODO for the future since it's probably not needed usually. Is there a use case for this functionality?

untom commented 8 years ago

No, the motivation is to specifcy the GPU-device you want to use. E.g. say you have 4 GPUs in one server and you want to work on GPU #2 (e.g. because the others are already occupied by some other process). There is currently no way to do this.

But you're right, since apparently we cant get this working cleanly right now it's probably best to revert this change for now.

flukeskywalker commented 8 years ago

Umm... I've been doing this simply like CUDA_DEVICE=1 python mnist.py for some time.

untom commented 8 years ago

How did i not know about that? ^^'

flukeskywalker commented 8 years ago

:D Haha you're supposed to be our PyCUDA guy!

flukeskywalker commented 8 years ago

Anyway, unless this can be easily fixed, I guess you can roll back these changes to a working state for now since the device selection is possible in other ways.

untom commented 8 years ago

Did this testcase ever work? Because even after reverting the changes it still fails on my machines? (Also, is jabber down today or is it just me?)

flukeskywalker commented 8 years ago

It passes at ac70bfdc06781cdaaf23c55a28c45f154acc3cc3 I am able to join the room at least... not sure if chats actually work. This is the only thing I use Jabber for :P

flukeskywalker commented 8 years ago

Closing now that things are stable at 866457126a86f09ffef74fe948b1669c26c92d2a