Open yang326922943 opened 3 days ago
You may refer to this link https://github.com/beichenzbc/Long-CLIP/issues/36#issuecomment-2182119130.
You may refer to this link #36 (comment).
Sorry, I probably didn't express myself clearly, I wanted to say still multi-gpu training, but without srun, only torch.distributed.launch, when i perform training, the training will be interrupted. It seems the OOM, and a process is killed
Sorry we only have a cluster with srun, so we can't give a complete version. If the problem is OOM, you may lower the batch size.
Can you give me a command without using srun for multi-GPUs training, I'm having trouble. Thanks.