multi gpu parallel computing

ysm022 commented 4 years ago

Hello, I can run train.py with very little dataset. 6 pics as train input, including 3 real and 3 fake. 2 pics as val. But when I use big dataset to train, there are total number of data: 11071 | pos: 5705, neg: 5366, total number of data: 1231 | pos: 634, neg: 597. I get error as follow:

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 1.158715GB memory on GPU 0, available memory is only 199.500000MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)

I use a nvidia 1080ti to train, the memory is about 11G. The error is ResourceExhaustedError. I have 4 pieces 1080ti. So how can I do multi gpu parallel computing? Thank you!

ZGSLZL commented 4 years ago

Hi, @ysm022 , firstly you can reduce your batch size and train on single GPU, secondly, if you want to use multi-GPUs training, you need to modify the following configuration:

multi_gpus=True in Runner()
modify code in train.py: place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) with fluid.dygraph.guard(place):
python -m paddle.distributed.launch train.py Please refer to paddle dygraph for more details

CoinCheung commented 4 years ago

Hi, Is the gradient of weights computed in each gpu averaged of summed up as the reduce operation ? If I use multi-gpu training mode, do I need to scale up the learning rate according to the linear rule ? By the way, would you please tell what is the ratio of positive/negative in the original dataset, will it be ok if our own dataset is significantly imbalanced( with positive:negative = 1:5)?

silvercherry commented 4 years ago

Hi, Is the gradient of weights computed in each gpu averaged of summed up as the reduce operation ? If I use multi-gpu training mode, do I need to scale up the learning rate according to the linear rule ? By the way, would you please tell what is the ratio of positive/negative in the original dataset, will it be ok if our own dataset is significantly imbalanced( with positive:negative = 1:5)?

hi have you solve this problem？ I also want to use muti-gpu to train and fix multi_gpus=True use python -m paddle.distributed.launch train.py but it can not use and

Error Message Summary:

Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

W0622 13:17:57.689676 13321 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 9.0 W0622 13:17:57.693465 13321 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-06-22 13:17:58,544-INFO: Loading pretrained model from ./pretrained/resnet18-torch 2020-06-22 13:17:58,728-ERROR: ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log. ERROR 2020-06-22 13:17:58,728 launch.py:284] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log. W0622 13:17:58.734313 13321 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly

Lucien7786 commented 4 years ago

I run to paddle github searching for multi-gpu sample. I have tested this mnist-project link code: https://github.com/PaddlePaddle/Paddle/issues/18205#issuecomment-508660371 with export CUDA_VISIBLE_DEVICES=0,1,2,3; python test.py and it works with 4 gpus. here is the log.

W0623 22:59:04.695741 9690 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 9.0 W0623 22:59:04.766938 9690 device_context.cc:245] device: 0, cuDNN Version: 7.6. I0623 22:59:07.364761 9690 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 4 cards are used, so 4 programs are executed in parallel. W0623 22:59:16.499653 9690 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 8. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 4. I0623 22:59:16.500174 9690 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0623 22:59:16.506099 9690 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0623 22:59:16.509552 9690 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0

Pass 0, Epoch 0, Cost [array([4.4588094, 5.480756 , 5.1428175, 5.9708376], dtype=float32), array([0.0625, 0.0625, 0.0625, 0. ], dtype=float32)]

Pass 100, Epoch 0, Cost [array([0.02101828, 0.1083214 , 0.23341596, 0.61708015], dtype=float32), array([1. , 1. , 0.9375, 0.8125], dtype=float32)]

Pass 200, Epoch 0, Cost [array([0.35945463, 0.30678317, 0.19145483, 0.24440879], dtype=float32), array([0.875 , 0.9375, 0.875 , 0.9375], dtype=float32)]

Pass 300, Epoch 0, Cost [array([0.01953058, 0.14692771, 0.09672394, 0.47347093], dtype=float32), array([1. , 0.9375, 0.9375, 0.9375], dtype=float32)]

Pass 400, Epoch 0, Cost [array([0.00153226, 0.05326104, 0.04533184, 0.03504385], dtype=float32), array([1., 1., 1., 1.], dtype=float32)]

Pass 500, Epoch 0, Cost [array([0.28838432, 0.21555214, 0.03272529, 0.32204518], dtype=float32), array([0.9375, 0.9375, 1. , 0.875 ], dtype=float32)]

Pass 600, Epoch 0, Cost [array([0.12778899, 0.01811488, 0.01242642, 0.16692397], dtype=float32), array([0.9375, 1. , 1. , 0.9375], dtype=float32)]

Pass 700, Epoch 0, Cost [array([0.10428553, 0.05949949, 0.02604522, 0.00989265], dtype=float32), array([0.9375, 1. , 1. , 1. ], dtype=float32)]

Pass 800, Epoch 0, Cost [array([0.4574466 , 0.0150936 , 0.00482975, 0.04338158], dtype=float32), array([0.875, 1. , 1. , 1. ], dtype=float32)]

Pass 900, Epoch 0, Cost [array([0.01204062, 0.01237518, 0.02565481, 0.47837988], dtype=float32), array([1. , 1. , 1. , 0.9375], dtype=float32)]

Test with Epoch 0, avg_cost: 0.08308441504291267, acc: 0.9728304140127388

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2407 C ./darknet 3203MiB | | 0 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 365MiB | | 1 2407 C ./darknet 3203MiB | | 1 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 359MiB | | 2 2407 C ./darknet 3203MiB | | 2 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 359MiB | | 3 2407 C ./darknet 3203MiB | | 3 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 33MiB | +-----------------------------------------------------------------------------+

so i guess there must be something setting wrong with multi-gpu mode, could u please check and give some tips for us?

VIS-VAR / LGSC-for-FAS