NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.02k stars 287 forks source link

train with CPU no CUDA -->RuntimeError: not enough memory #146

Closed mbajlo closed 3 years ago

mbajlo commented 3 years ago

Hello all,

I am trying to train the network on custom object dataset made by the NDDS. I have a ATI Radeon graphic card so I cannot use the CUDA parallel processing. I have installed torchand torchvision(both cpu and cp38 versions) and I have tried to run the train.py like this:

python train.py --data C:\Users\USER\Desktop\ML\object_dataset\object\ --batchsize 200 --pretrained False --datatest C:\Users\USER\Desktop\ML\object_dataset\object_test\ --object thorHammer --workers 1 --outf outfObject

After the script is running for some time, I got this:

start: 21:19:38.212774 load data training data: 1 batches testing data: 1 batches load models Training network without imagenet weights. Traceback (most recent call last): File "train.py", line 1403, in _runnetwork(epoch,trainingdata) File "train.py", line 1345, in _runnetwork output_belief, output_affinities = net(data) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\parallel\data_parallel.py", line 149, in forward return self.module(*inputs, *kwargs) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "train.py", line 188, in forward out5_1 = self.m5_1(out5) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\container.py", line 117, in forward input = module(input) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\conv.py", line 423, in forward return self._conv_forward(input, self.weight) File "C:\Users\USER\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\nn\modules\conv.py", line 419, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:73] data. DefaultCPUAllocator: not enough memory: you tried to allocate 256000000 bytes. Buy new RAM! PS C:\Users\USER\Desktop\ML\dope_train\scripts>

Is there something I can do, beside to get more RAMs? I have win10 with 24GB RAM

any help is appreciated, Thanks

mbajlo commented 3 years ago

I have made a new dataset with lower number of frames for training and testing the ratio is 5:1 and after that I was able to increase the number of workers to 4, maybe even more, but I kept number 4. I am closing this one.