HPG hangs after a few epochs with multiple GPUs

sasank-desaraju commented 1 year ago

Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.

Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.

sasank-desaraju commented 1 year ago

We should probs start by profiling the code. Maybe the NVIDIA NSight profiler?

sasank-desaraju commented 1 year ago

I ran it with 2 A100s with 31GB of memory, shooting for 200 epochs. It got to epoch 49 before it had OSError [Errno 28] No space left on device. Then it showed 2 BrokenPipeError: [Errno 32] Broken pipe errors. Additionally, the email I got from SLURM said that the CPU was used almost the entire time (96% efficiency) and that 27.2/31 GB of memory was used.

The fact that there were two broken pipes right after the "No space left on device" makes me think that the CPU ran out of memory, which caused it to stop and break the pipes leading to the GPUs, causing them to fail with BrokenPipe errors. I would like to know what the CPU is being used for and what it is using its RAM memory for. Maybe we need to use some torch calls (in the DataModule file?) to control where data is being stored during the run. Maybe we can start by writing to disk and clearing RAM after every epoch.

sasank-desaraju commented 1 year ago

Tools for fixing this:

One of Lightning's (multiple) profilers looks good if we can work it.
For CUDA tensors, Tensor.get_device yields the device name whereas for CPU tensors it throws an error.
PyTorch's native profiler. I'm trying this now.

sasank-desaraju commented 1 year ago

It appears that self.pose_hrnet, created at the beginning of pose_hrnet_module.py, was living on the CPU and not on the GPU. I tried to remedy this by adding a line right after stating (self.pose_hrnet.to(device='cuda', dtype=torch.float32. Am currently trying to run a training loop.

BRIO-lab / LitJTML

HPG hangs after a few epochs with multiple GPUs #1