Open ysm022 opened 4 years ago
Hi, @ysm022 , firstly you can reduce your batch size and train on single GPU, secondly, if you want to use multi-GPUs training, you need to modify the following configuration:
place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
with fluid.dygraph.guard(place):
Hi, Is the gradient of weights computed in each gpu averaged of summed up as the reduce operation ? If I use multi-gpu training mode, do I need to scale up the learning rate according to the linear rule ? By the way, would you please tell what is the ratio of positive/negative in the original dataset, will it be ok if our own dataset is significantly imbalanced( with positive:negative = 1:5)?
Hi, Is the gradient of weights computed in each gpu averaged of summed up as the reduce operation ? If I use multi-gpu training mode, do I need to scale up the learning rate according to the linear rule ? By the way, would you please tell what is the ratio of positive/negative in the original dataset, will it be ok if our own dataset is significantly imbalanced( with positive:negative = 1:5)?
Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)
W0622 13:17:57.689676 13321 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 9.0 W0622 13:17:57.693465 13321 device_context.cc:245] device: 0, cuDNN Version: 7.6. 2020-06-22 13:17:58,544-INFO: Loading pretrained model from ./pretrained/resnet18-torch 2020-06-22 13:17:58,728-ERROR: ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log. ERROR 2020-06-22 13:17:58,728 launch.py:284] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log. W0622 13:17:58.734313 13321 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0623 22:59:04.695741 9690 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 9.0 W0623 22:59:04.766938 9690 device_context.cc:245] device: 0, cuDNN Version: 7.6. I0623 22:59:07.364761 9690 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 4 cards are used, so 4 programs are executed in parallel. W0623 22:59:16.499653 9690 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 8. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 4. I0623 22:59:16.500174 9690 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1 I0623 22:59:16.506099 9690 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True I0623 22:59:16.509552 9690 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2407 C ./darknet 3203MiB | | 0 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 365MiB | | 1 2407 C ./darknet 3203MiB | | 1 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 359MiB | | 2 2407 C ./darknet 3203MiB | | 2 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 359MiB | | 3 2407 C ./darknet 3203MiB | | 3 9690 C ...iniconda3/envs/paddle-gpu/bin/python3.7 33MiB | +-----------------------------------------------------------------------------+
so i guess there must be something setting wrong with multi-gpu mode, could u please check and give some tips for us?
Hello, I can run train.py with very little dataset. 6 pics as train input, including 3 real and 3 fake. 2 pics as val. But when I use big dataset to train, there are total number of data: 11071 | pos: 5705, neg: 5366, total number of data: 1231 | pos: 634, neg: 597. I get error as follow:
Error Message Summary:
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 1.158715GB memory on GPU 0, available memory is only 199.500000MB.
Please check whether there is any other process using GPU 0.
If no, please decrease the batch size of your model.
at (/paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:69)
I use a nvidia 1080ti to train, the memory is about 11G. The error is ResourceExhaustedError. I have 4 pieces 1080ti. So how can I do multi gpu parallel computing? Thank you!