Closed XiaotaoChen closed 4 years ago
Hi, I will try to repro that and get back to you soon with more details.
Hi, I will try to repro that and get back to you soon with more details.
Thanks
Hi, I will try to repro that and get back to you soon with more details.
Hi, Did you repro my problems ? @JanuszL
Hi, A few comments:
pipe_out = pipes[worker_id].run()
data_queue[worker_id].put(pipe_out)
is not going to work. When you call run
again the previous outputs are invalidated. So you need to copy them out or to NumPy array or FrameworkTensor. Please take a look at how it is done in the PyTorch iterator
you are using Python multithreading - it won't allow you to scale with the number of GPUs. run()
doesn't release Python GIL and no other run
can be done in parallel. DALI run
is simple and it launches execution and waits for the results. What you can do is to use schedule_run
, share_outputs
, release_outputs
. schedule_run
will allow you to launch work on all DALI pipelines (it is short operation), share_outputs
will block until you get the output, then you need to copy data to side storage and release DALI buffer by calling release_outputs
. Like
pipeline.schedule_run()
while 1:
out = pipeline.share_outputs()
copy(Torch.Tensor, out)
pipeline.release_outputs()
pipeline.schedule_run()
data_queue.put(Torch.Tensor)
But even with such approach with the number of GPUs 4 or more, it is very likely that for most of the time one DALI instance is waiting for the results preventing other Python threads from executing and scheduling work on other GPUs.
Thanks a lot for the detailed explanation. @JanuszL In my DALILoader
, pipe_out
has copied out to MXNDArray
. The schedule_run
example is useful for me. Does schedule_run()
release GIL ? And I'll try to use Distributed Data Parallel to avoid Python GIL and CPU bottleneck.
Hi,
Does schedule_run() release GIL It doesn't but it only launches work that is done in the native DALI thread and returns to Python without doing any substantial work. So even it doesn't release GIL it doesn't wait for the processing result and you can consider this operation as a rather fast one.
Hi,
Does schedule_run() release GIL It doesn't but it only launches work that is done in the native DALI thread and returns to Python without doing any substantial work. So even it doesn't release GIL it doesn't wait for the processing result and you can consider this operation as a rather fast one.
Thanks a lot @JanuszL
@JanuszL Hi, I tried COCOPipeline
with multip-process, its' performance is great and linear scaling align with gpu counts. And also tried DALIGenericIterator
in nvidia.dali.plugin.mxnet
. it's blocked by python GIL too. the performance with different gpu counts as belows:
gpu count | speed (samples/s) | speed repeat (samples/s) |
---|---|---|
1 | 259.91 | 354.45 |
2 | 261.29 | 344.21 |
4 | 209.75 | 317.83 |
8 | 228.6 | 200.32 |
There are two questions about DALIGenericiterator
:
HybridTrainPipe
with DALIGenericIterator
, whose input is mxnet rec file
for classification task. Its' performance is amazing. Why HybridTrainPipe
have no Python GIL problems ?the test script as belows
def test_dali_mxloader():
file_root = "data/coco/images/val2014"
annotations_file = "data/instances_val2014_horz.json"
num_gpus = 8
single_batch_size = 2
sizes = {"horizontal": 29221, "vertical":11283}
direction = "horizontal"
data_size = sizes[direction]
pipes = [COCOPipeline(batch_size=single_batch_size, num_threads=2, device_id=device_id,
file_root=file_root, annotations_file=annotations_file,
short=800, long=1200, num_gpus=num_gpus, direction=direction) for device_id in range(num_gpus)]
output_map = {("image", "data"),
("bbox", "data"),
("label", "data"),
("src_shape", "data"),
("resized_shape", "data")
}
loader = DALIGenericIterator(pipes, output_map, size = str(data_size))
data = next(loader)
print(data)
mx.nd.waitall()
size = 0
start_time = time.time()
for i in range(100):
tic = time.time()
data = next(loader)
mx.nd.waitall()
interval = time.time() - tic
size = len(data) * data[0].data[0].shape[0]
if i > 0 and i % 5 == 0:
print("{} size:{} speed: {} samples/s".format(i, size, size /interval))
mx.nd.waitall()
total_time = time.time() - start_time
print("avg speed {} samples/s".format(100*size/total_time))
if __name__ == "__main__":
test_dali_mxloader()
hybridTrainPipeline defination:
class HybridTrainPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, num_gpus, db_folder):
super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id)
self.input = ops.MXNetReader(path=[os.path.join(db_folder, "train.rec")], index_path=[os.path.join(db_folder, "train.idx")],
random_shuffle=True, shard_id=device_id, num_shards=num_gpus)
self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB)
self.rrc = ops.RandomResizedCrop(device="gpu", size=(224, 224))
self.cmnp = ops.CropMirrorNormalize(device="gpu",
output_dtype=types.FLOAT,
output_layout=types.NCHW,
crop=(224, 224),
image_type=types.RGB,
mean=[0.485 * 255,0.456 * 255,0.406 * 255],
std=[0.229 * 255,0.224 * 255,0.225 * 255])
self.coin = ops.CoinFlip(probability=0.5)
def define_graph(self):
rng = self.coin()
self.jpegs, self.labels = self.input(name = "Reader")
images = self.decode(self.jpegs)
images = self.rrc(images)
output = self.cmnp(images, mirror = rng)
return [output, self.labels]
Does the Python GIL cause the more gpu counts and worse performance ? I'm not sure.
We didn't investigate that very deeply as the distributed data-parallel is the way we recommend. But this is most likely what may be happening there - Python is just blocked in one pipeline while the rest is waiting for the work to be scheduled.
I also tried HybridTrainPipe with DALIGenericIterator , whose input is mxnet rec file for classification task. Its' performance is amazing. Why HybridTrainPipe have no Python GIL problems ?
Do you use raw ImageNet or resized one? Again, this is mostly my guess, but processing of every batch could be short enough that whenever python asks pipeline for outputs it is already there and there is no waiting. If there is any waiting on any pipeline, it may happen that work on other pipelines is already done and output just waits for this slowest pipeline.
Does the Python GIL cause the more gpu counts and worse performance ? I'm not sure.
We didn't investigate that very deeply as the distributed data-parallel is the way we recommend. But this is most likely what may be happening there - Python is just blocked in one pipeline while the rest is waiting for the work to be scheduled.
I also tried HybridTrainPipe with DALIGenericIterator , whose input is mxnet rec file for classification task. Its' performance is amazing. Why HybridTrainPipe have no Python GIL problems ?
Do you use raw ImageNet or resized one? Again, this is mostly my guess, but processing of every batch could be short enough that whenever python asks pipeline for outputs it is already there and there is no waiting. If there is any waiting on any pipeline, it may happen that work on other pipelines is already done and output just waits for this slowest pipeline.
About imagenet rec file, you are right, the rec file is resized. Thanks for your helpful reply.
I'm trying
COCOPipeline
to speedup training performance for detection tasks. And benchmark the performance with different GPU counts. There are two problems confused me.envs
hardwares:
softwares
problems
1. the performance can't linear scaling align with GPU counts with multip-threads(each gpu launch a thread). especially, when GPU counts>=4. I test the performance with
DALI_extra
data andcoco2014_val
dataset respectively. And the performance as below.notices:
repeat 2
,repeat 3
,repeat 4
means repeat test it.as above table showed. when dataset is
DALI_extra
, the performance is improved obviously with GPU counts from 1 to 2. But when GPU counts=4 or 8, it's performance is same as 2 gpus nearly. CanCOCOPipeline
's performance be Linear scaling ? how can i do to improve multipe-GPU's performance ?2. COCOPipeline's performance is unstable in COCO2014_val dataset. there are two problems with
coco2014_val
dataset. 1. as above mentioned, the performance of multipe-gpus is almost same with single gpu. even worse than single gpu. 2. its' performance is unstable, with the wide range of its' fluctuations. Can you give some advices for the possible reasons ?COCOPipeline definition and benchmark script
COCOPipeline definition as below
benchmark scripts