Closed yxchng closed 3 weeks ago
Can you please describe the problem in more detail? Do you mean CUDA OOM?
Not CUDA memory. RAM memory using more than 500gb. 内存用很多(>500gb),不是显存。
That's interesting, the entire Embodiedscan dataset only takes up about 300G. Are you sure there are no other programs taking RAM up?
@mxh1999 Model itself has to use some RAM too. Quite sure your machine has more than 500gb memory so you didn't pay attention to memory usage. Maybe you can check what is the peak memory usage? It shoots up to more than 500gb during evaluation phase.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
I met this ram issue as well, when using 8 gpus with the ddp way to train the 3ddet;
it cost more than 500GB of RAM, 500G reaches my RAM limitation and ddp process will fail; When using 4 gpus to train, 4x4 batch size, it cost nearly 400GB of RAM.
i am not sure the situation if use slurm to start.
I am looking forward to any progress on this issue.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.
I have met the same question too. When I train the model on a server with 700G mem, everything is fine. When I move to a server with 200G mem, before it finish epoch 1, it will always trigger kernel OOM kill.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.
@henryzhengr hi, Could you share your solution code? That's will help me a lot. Thank you.
For quick solution just replace the following lines to the code below https://github.com/OpenRobotLab/EmbodiedScan/blob/67110231a8759009ca822ff3f2b3ed577674903b/embodiedscan/datasets/embodiedscan_dataset.py#L59-L64
Make sure to install SharedArray in your environment
super().__init__(ann_file=ann_file,
metainfo=metainfo,
data_root=data_root,
pipeline=pipeline,
test_mode=test_mode,
serialize_data=False,
**kwargs)
self.share_serialize_data()
def share_serialize_data(self):
cur_rank, num_gpus = mmengine.dist.get_rank(), mmengine.dist.get_world_size()
if cur_rank == 0:
print("Rank 0 initialized the data")
if os.path.exists(f"/dev/shm/embodiedscan_data_bytes"):
self.data_bytes = SharedArray.attach(f"shm://embodiedscan_data_bytes")
self.data_address = SharedArray.attach(f"shm://embodiedscan_data_address")
else:
self.data_bytes, self.data_address = self._serialize_data()
print(f'Loading training data to shared memory (file limit not set)')
data_bytes_shm_arr = SharedArray.create('shm://embodiedscan_data_bytes', self.data_bytes.shape, dtype=self.data_bytes.dtype)
data_bytes_shm_arr[...] = self.data_bytes[...]
data_bytes_shm_arr.flags.writeable = False
data_address_shm_arr = SharedArray.create('shm://embodiedscan_data_address', self.data_address.shape, dtype=self.data_address.dtype)
data_address_shm_arr[...] = self.data_address[...]
data_address_shm_arr.flags.writeable = False
print(f'Training data list has been saved to shared memory')
dist.barrier()
else:
dist.barrier()
print(f'Reading training data from shm. rank {cur_rank}')
self.data_bytes = SharedArray.attach("shm://embodiedscan_data_bytes")
self.data_address = SharedArray.attach("shm://embodiedscan_data_address")
print(f'Done reading training data. rank {cur_rank}')
self.serialize_data = True
In original code, the more GPU you use the more RAM the program consumes. So this will save RAM for distributed training on multiple GPUs, but not on single GPU.
Thank @henryzhengr for providing a work-around solution. I close this issue for now and welcome further discussions if there are new observations.
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
Reproduces the problem - code sample
-
Reproduces the problem - command or script
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py
Reproduces the problem - error message
job schedular indicates
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Additional information
Evaluation shouldn't use that much memory, 500+gb is crazy!