Error occurred in training session when using multi-gpu

NJU-LHRS / official-CMID

The official implementation of paper "Unified Self-Supervised Learning Framework for Remote Sensing Images".

78 stars 5 forks source link

Error occurred in training session when using multi-gpu #11

Closed mrk1992 closed 11 months ago

mrk1992 commented 1 year ago

First of all, thank you for sharing your work.

When I train model using multi-gpu, the error occurred.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank:

I use 4-V100 32GB

cuda environments CUDA 11.3 CUDNN 8.2.0

python environments torch 1.11.0

Some people said that the problem is the large dataset. Is there any way to solve this problem?

Thank you in advance.

pUmpKin-Co commented 1 year ago

Hi~Thanks for your interest in our work. The question you posed is not quite accurate, pytorch's DDP training errors are described specifically before the sentence you posed, you can provide a full description of the error so I can help you better. There is no problem with the version, if it is true that there is no detailed description of the error, you can try to reduce the number of num_workers in the Dataloader.

pUmpKin-Co commented 11 months ago

Closed as long periods of inactivity, feel free to reopen if there is any problem.