Open tornadomeet opened 7 years ago
v0.7.0 and v0.8.0 is ok, master will bring this error.
I may meet similar problem:
[11:09:02] src/nnvm/legacy_json_util.cc:153: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... [11:09:03] /data00/tiger/.jenkins/workspace/lab_mxnet/dmlc-core/include/dmlc/./logging.h:300: [11:09:03] /data00/tiger/.jenkins/workspace/lab_mxnet/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error
Stack trace returned 6 entries: [bt] (0) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x29) [0x7f241b51d2b9] [bt] (1) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xb8) [0x7f241bfeb078] [bt] (2) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x20) [0x7f241bfee840] [bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb6970) [0x7f24afe93970] [bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7f24b400e0a4] [bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f24b342062d]
terminate called after throwing an instance of 'dmlc::Error' what(): [11:09:03] /data00/tiger/.jenkins/workspace/lab_mxnet/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: initialization error
Stack trace returned 6 entries: [bt] (0) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x29) [0x7f241b51d2b9] [bt] (1) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow9SetDeviceINS_3gpuEEEvi+0xb8) [0x7f241bfeb078] [bt] (2) /opt/tiger/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x20) [0x7f241bfee840] [bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb6970) [0x7f24afe93970] [bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7f24b400e0a4] [bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f24b342062d]
The only way to reliably use cuda with multiprocessing is to import mxnet in after creating subprocesses.
.
I don't have any problems executing above code with a current version of mxnet. @piiswrong do you have any insight into why it is working now compared to earlier this year? @tornadomeet do you still experience this issue? Perhaps it is related to a different cuda version/system configuration.. https://github.com/dmlc/mxnet/pull/4695 seems to contain the fix.
In general I believe using python multiprocessing and specifying the forkserver start method before importing mxnet should be a workaround for any cuda related multiprocessing issues. In particular it should still allow creating new processes after mxnet was imported, as the processes are forked from the forkserver which has no cuda context. This also seems to be what pytorch is doing.
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
SO , no solution to solve the problem?
import numpy as np
import mxnet as mx
from multiprocessing import Process, current_process
def test():
print("process id is {:s}".format(current_process().name))
a = mx.nd.array(np.zeros((100, 100, 100, 100)), mx.gpu(0))
a.asnumpy()
if __name__ == '__main__':
# worker_count = multiprocessing.cpu_count() -2
worker_count = 8
runs = [Process(target=test) for i in range(1)] # 1 or 2 or N process is the same error
for p in runs:
p.start()
for p in runs:
p.join()
print("done!")
It is magical! I found it is OK when I set worker_count less than 8, while it doesn't work when worker_count more than 8!
@mxnet-label-bot add [Python, Bug]
@szha Has this issue been resolved? I have not been able to reproduce the exact issue, It stalls to fail only when GPU runs out of memory, I have been able to spawn more than 10 workers with the example script. I see a related PR has been merged in dmlc/gluon-nlp repo
@leezu might still have some issue with it so let's wait for his comment too.
Here is an updated test case
import numpy as np
import mxnet as mx
from multiprocessing import Process, current_process
def test():
a = mx.random.seed(1)
if __name__ == '__main__':
a = mx.nd.random_normal(shape=(10,10), ctx=mx.gpu(0))
runs = [Process(target=test) for i in range(1)]
for p in runs:
p.start()
for p in runs:
p.join()
Here Cuda is initialized on the parent process before calling the child processes. You may argue, that GPU operations in the child processes should not be supported, but then the situation must be handled gracefully, ie. throw some error on the Python side and not the C++ side. But let's accept the current C++ exception. Even then, if we only want to do CPU work in the child process, above example will crash as the random.seed
calls some Cuda related code internally. So there is currently no option to have deterministic execution of code in the child processes and code may crash at unexpected times (such as calling random.seed).
@leezu
Here is something even more complex which works, I thought anybody else may come here and needs the solution - which does not work if you do not force mp.set_start_method('forkserver', force=True)
import random
import numpy as np
import mxnet as mx
import multiprocessing as mp
def test():
mx.random.seed(random.randint(10,200))
a = mx.nd.random_normal(shape=(2,2), ctx=mx.gpu(0))
print('child no. ', mp.current_process().name, ':' , a)
if __name__ == '__main__':
mp.set_start_method('forkserver', force=True)
ab = mx.nd.random_normal(shape=(2,2), ctx=mx.gpu(0))
print('main proc.: ', ab)
runs = [mp.Process(target=test) for i in range(3)]
for p in runs:
p.start()
for p in runs:
p.join()
print('done')
Hope it helps.
Still facing issues unable to work and now have to change the entire architecture of the application because of this
@mxnet-label-bot add [Backend]
Related: https://github.com/apache/incubator-mxnet/issues/14979
Forking the library is not supported as of now.
I also can't reproduce this with the latest master
In [2]: import numpy as np
...: import mxnet as mx
...: from multiprocessing import Process, current_process
...:
...: def test():
...: print("process id is {:s}".format(current_process().name))
...: a = mx.nd.array(np.zeros((100, 100, 100, 100)), mx.gpu(0))
...: a.asnumpy()
...:
...: if __name__ == '__main__':
...: runs = [Process(target=test) for i in range(2)] # 1 or 2 or N process is the same error
...: for p in runs:
...: p.start()
...: for p in runs:
...: p.join()
...: print("done!")
...:
process id is Process-2
process id is Process-3
done!
In [1]: import numpy as np
...: import mxnet as mx
...: from multiprocessing import Process, current_process
...:
...: def test():
...: print("process id is {:s}".format(current_process().name))
...: a = mx.nd.array(np.zeros((100, 100, 100, 100)), mx.gpu(0))
...: a.asnumpy()
...:
...: if __name__ == '__main__':
...: runs = [Process(target=test) for i in range(1)] # 1 or 2 or N process is the same error
...: for p in runs:
...: p.start()
...: for p in runs:
...: p.join()
...: print("done!")
...:
process id is Process-1
done!
@PascalIversen provided a new reproducer: https://github.com/apache/incubator-mxnet/issues/19291
reproduce code
Os: linux, centos 7 + cuda7.5 + cuDNN 5.1
log: