Open kohillyang opened 6 years ago
The latest version of mxnet also have this bug.
@mxnet-label-bot add [Gluon, Thread Safety, Bug]
@kohillyang
Thanks for submitting the issue. I have added the labels so that community members can provide the help.
I'm facing the same problem. Has this issue been fixed?
mxnet in general is not thread safe. You can accomplish the above using multiprocessing.
import multiprocessing as mp
import gluoncv
import mxnet as mx
net = gluoncv.model_zoo.resnet18_v1b(pretrained=True)
net.hybridize()
def worker(module, input, outputs):
outnd = module(input) # type: mx.nd.NDArray
outnd.wait_to_read()
outputs.put(outnd)
ps = []
outputs = mp.Queue(5)
for i in range(3):
input1 = mx.random.randn(1, 3, 368, 368)
p = mp.Process(target=worker, args=(net, input1, outputs))
ps.append(p)
for p in ps:
p.start()
for p in ps:
p.join()
while not outputs.empty():
print(outputs.get().shape)
But unlike pytorch, it is not possible to optimize the network if using Process instead. I found a inconvenient way to solve it is to inference once before pushing it into sub-threads.
@kohillyang
In my opinion, supporting multi-threading in Python will drop the performance, because we need to add locks to keep thread-safety.
I think it's better to use multi-process in Python, which has GIL and create a fake multi-threading. We could pass NDArray object through Pipe, e.g. Gluon DataLoader
Could you please provide some projects which use multi-threading to optimize a network? We may support multi-threading in Python if it is necessary. Thank you!
I submited a PR just now, which may support multi-theading environment for Gluon.
https://github.com/apache/incubator-mxnet/pull/14344
BTW, in the testing case in this issue, outputs = [None for _ in range(len(ctx_list))]
@wkcn, Glad to see this issue is going to be resolved. One case threading is needed is that when some operators are written by numpy, especially when the batch size is small, which is common in object detection. Using multi-threading can get a speed improvement of about 15% according to my test on 8x P40
Anyway, At least according to my test, mxnet has already supported the multi-threading training, for example, https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/parallel.py, and https://github.com/dmlc/gluon-cv/blob/master/scripts/segmentation/train.py uses parallel.py
to speed training up. There maybe no extra works need to be done.
@kohillyang Thank you! In Object Detection, do you mean that the proposal and proposal_target layers are custom operators, written by numpy, and using multi-threading to execute these NumPy operators parallelly can accelerate them?
Yes. Another case is that if some codes is written by Block rather than HybridBlock if some codes can hardly be packaged into a single Operator and asnumpy is called (sometimes because dynamic shape inference is almost impossible.). In this case if more than one gpu are used and not using mult-threading, the network can not easily be paralleled.
Since dynamic network is becoming more and more popular, I think supporting multi-threading is needed.
@kohillyang Hi! Could you please provide an example code to show how to run the operator written by numpy in parallel? Thanks!
I see. There is only one thread to execute Custom Operator.
Did you modify src/operator/custom/custom-inl.h
to support multi-threading?
I didn't modify src/operator/custom/custom-inl.h, but there can be more than one thread to excucte custom Operator. I mean, considering there are only one network, it has each individual copy on each GPU, so I think they can be treated several independent networks when forwarding. And if we have n GPUs, we execute n threads, one thread per one GPU, to inference and back-propagate these networks. Then there should be n threads to execute the custom Operator. As the GIL is freed when CPP codes are executed, and to the best of my knowns, there is no lock in mxnet in this case to force Custom Operator to be executed in only one thread, using multi-threading can speed these operators up.
But I'm not sure whether mxnet forces only one thread to execute Custom Operator.
@kohillyang I submitted a PR to support multi-threading for Custom Operator https://github.com/apache/incubator-mxnet/pull/14363, but I do not know how did you accelerate it without modifying custom-inl.h Could you upload your code? Did you use Python multi-threading to implement it?
@wkcn I have to admit that I was wrong. I have tried to write a small test case and I found it is absolutely impossible to run a same CustomOp in different threads. More worse, I found my program sometimes cashes if multi-threading is used.
Here is my test codes:
import os
# os.environ["MXNET_ENGINE_TYPE"]="NaiveEngine"
import mxnet as mx
import time
import threading
import numpy as np
import cv2
import os
cv2.setNumThreads(1) # Sometimes we need this to avoid deadlock, especially in multi-processing environments.
class TestOP(mx.operator.CustomOp):
def __init__(self, *args, **kwargs):
super(TestOP, self).__init__(*args, **kwargs)
print("init")
def forward(self, is_train, req, in_data, out_data, aux):
try:
x = in_data[0].asnumpy()
print("ss")
x = np.ones(shape=(1024, 1024, 300))
x_resized = cv2.resize(x, (0, 0), fx=0.5, fy=0.5)
x_resized_sum = x_resized.sum()
print('ee', x_resized_sum)
except Exception as e:
print(e)
@mx.operator.register("test_op")
class TestOPProp(mx.operator.CustomOpProp):
def __init__(self):
super(TestOPProp, self).__init__()
def list_arguments(self):
return ['x']
def list_outputs(self):
return ['y']
def infer_shape(self, in_shape):
return in_shape, in_shape
def create_operator(self, ctx, shapes, dtypes):
return TestOP()
ctx_list = [mx.gpu(x) for x in [0, 1, 2, 3]]
x_list = [mx.nd.ones(shape=(1, 2), ctx=c) for c in ctx_list]
data = mx.sym.var(name="data")
y = mx.sym.Custom(data, op_type="test_op")
y = mx.sym.identity(y, name="identity")
sym_block = mx.gluon.SymbolBlock(outputs=y, inputs=data)
sym_block.collect_params().reset_ctx(ctx_list)
def forward(x, ctx):
# print("enter", x)
re = sym_block(x)
re.wait_to_read()
# print("exit")
return re
# for x, c in zip(x_list, ctx_list):
# forward(x, c)
# mx.nd.waitall()
threads = []
for x, c in zip(x_list, ctx_list):
t = threading.Thread(target=forward, args=(x, c))
t.daemon = True
t.start()
#
for t in threads:
t.join()
mx.nd.waitall()
It cashes without any Exception and outputs.
if line print("enter", x)
is not committed, it does not crash but the cpu usage is less than 100%, and the outputs are in order so I am sure that there is only one thread to execute CustomOP.
Thanks for your report! I will check it. I have closed the previous PR, since I found that it is too complex to support multi-threading. The issue is still considered. There are some bugs when running MXNet on multi-threading or multi-process, e.g. https://github.com/apache/incubator-mxnet/issues/14396
Description
(Brief description of the problem in no more than 2 sentences.)
Environment info (Required)
Package used (Python/R/Scala/Julia): (I'm using Python)
Error Message:
Minimum reproducible example
Steps to reproduce
(Paste the commands you ran that produced the error.)
1. 2.
What have you tried to solve it?
Forward once before starting the thread.