Open sasank-desaraju opened 1 year ago
We should probs start by profiling the code. Maybe the NVIDIA NSight profiler?
I ran it with 2 A100s with 31GB of memory, shooting for 200 epochs. It got to epoch 49 before it had OSError [Errno 28] No space left on device
. Then it showed 2 BrokenPipeError: [Errno 32] Broken pipe
errors. Additionally, the email I got from SLURM said that the CPU was used almost the entire time (96% efficiency) and that 27.2/31 GB of memory was used.
The fact that there were two broken pipes right after the "No space left on device" makes me think that the CPU ran out of memory, which caused it to stop and break the pipes leading to the GPUs, causing them to fail with BrokenPipe errors. I would like to know what the CPU is being used for and what it is using its RAM memory for. Maybe we need to use some torch calls (in the DataModule file?) to control where data is being stored during the run. Maybe we can start by writing to disk and clearing RAM after every epoch.
Tools for fixing this:
It appears that self.pose_hrnet, created at the beginning of pose_hrnet_module.py, was living on the CPU and not on the GPU. I tried to remedy this by adding a line right after stating (self.pose_hrnet.to(device='cuda', dtype=torch.float32. Am currently trying to run a training loop.
Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.
Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.