er-muyue / DeFRCN

MIT License
181 stars 43 forks source link

OutOfMemoryError with PrototypicalCalibrationBlock #73

Open gladdduck opened 5 months ago

gladdduck commented 5 months ago

Hello, when I train my dataset using DeFRCN, I encountered an issue. The base training process goes smoothly, but when I attempt K-shot finetuning, I keep getting an OutOfMemoryError.

I tried to solve it and found that when setting PCB_ENABLE to False, this issue doesn't occur.

However, when PCB_ENABLE is set to True, even if I adjust IMS_PER_BATCH to 1 on A100-40G, I still encounter the OutOfMemoryError.

Has anyone else experienced a similar issue? How was it resolved?

cnjhh commented 4 months ago

Solution is to locate the PCB module :/path/defrcn/defrcn/evaluation/calibration_layer py build_prototypes function in the code: 'All_feature.append (feature.cpu ().data)' adds the following code: features =None.

gladdduck commented 4 months ago

Solution is to locate the PCB module :/path/defrcn/defrcn/evaluation/calibration_layer py build_prototypes function in the code: 'All_feature.append (feature.cpu ().data)' adds the following code: features =None.

thanks for your reply! this works!

gladdduck commented 4 months ago

Solution is to locate the PCB module :/path/defrcn/defrcn/evaluation/calibration_layer py build_prototypes function in the code: 'All_feature.append (feature.cpu ().data)' adds the following code: features =None.

However, this error still occurs from time to time. the code locate in calibration_layer py build_prototypes function features = self.extract_roi_features(img, boxes) extract_roi_features function conv_feature = self.imagenet_model(images.tensor[:, [2, 1, 0]])[ I'm very confused about this, even though I used gc.collect() and torch.cuda.empty_cache()

cnjhh commented 4 months ago

features = self.extract_roi_features(img, boxes) boxes = None img = None all_features.append(features.cpu().data) features = None

features create by you customs datasets for novel classes ,you can solve this by reducing the number of novel classes, or generate features offline, instead of loading the novel datas to train it when the model validated, save it through the pickle module, and then modify the code to load the offline trained one directly during validation

cnjhh commented 4 months ago

The device I use is A800 80G, and the novel data I set is 10 shot 13class, and when the model is loaded with the pcb module, the video memory occupies 53G, and 80G is not enough before the modification