NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.03k stars 287 forks source link

RuntimeError: CUDA out of memory (Training phase) #220

Closed erikkockar closed 2 years ago

erikkockar commented 2 years ago

Hi guys,

I am trying to train DOPE NN using your training script. After command python3 train.py --data ~/ndds/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neuron --gpuids 0

I get output: start: 16:25:23.770094 load data training data: 64 batches load models Training network pretrained on imagenet. Traceback (most recent call last): File "train.py", line 1392, in <module> _runnetwork(epoch,trainingdata) File "train.py", line 1334, in _runnetwork output_belief, output_affinities = net(data) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "train.py", line 153, in forward out1 = self.vgg(x) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 443, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/erikkockar/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 439, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 5.78 GiB total capacity; 3.60 GiB already allocated; 283.88 MiB free; 3.62 GiB reserved in total by PyTorch)

Output of nvidia-smi My GPU is Nvidia Gforce RTX 3060 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A | | N/A 41C P8 17W / N/A | 492MiB / 6144MiB | 4% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1082 G /usr/lib/xorg/Xorg 59MiB | | 0 N/A N/A 1627 G /usr/lib/xorg/Xorg 189MiB | | 0 N/A N/A 1755 G /usr/bin/gnome-shell 48MiB | | 0 N/A N/A 2445 G /usr/lib/firefox/firefox 183MiB | +-----------------------------------------------------------------------------+

I also modified cuda visible devices to 0 since my output of nvidia-smi is 0 os.environ["CUDA_VISIBLE_DEVICES"]="0"

After searching for a problem on the internet I could find just try to reboot or try different torch versions. Now I am using 1.9.0 but also tried 1.9.1 and 1.10.2.

Do you know where the problem may be ? I think that my GPU has enough computation power to do training, or am I wrong ? Thanks for answers

HiHiAllen commented 2 years ago

set --batchsize 16

erikkockar commented 2 years ago

set --batchsize 16

batchsize of 16 RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 5.78 GiB total capacity; 3.73 GiB already allocated; 19.06 MiB free; 3.77 GiB reserved in total by PyTorch)

batchsize of 8 RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.78 GiB total capacity; 3.26 GiB already allocated; 31.38 MiB free; 3.32 GiB reserved in total by PyTorch)

But with newer version of torch I got extension of the error and it says If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

HiHiAllen commented 2 years ago

Sorry for not solving your problem. I haven't saw this error before. I am using cuda11 pytorch1.7 .

erikkockar commented 2 years ago

Sorry for not solving your problem. I haven't saw this error before. I am using cuda11 pytorch1.7 .

Little update batchsize of four works for me. But still wondering why I cant do more since I believe I have computation power for that. Seems to me to be torch problem with wrong allocation. python3 train.py --batchsize 4 --data ~/ndds/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neuron --gpuids 0

Another thing now I am wondering about that I am training my dataset on pretrained image net and loss is too small from the start:


start: 09:33:31.601213
load data
training data: 506 batches
load models
Training network pretrained on imagenet.
Train Epoch: 1 [0/2022 (0%)]    Loss: 0.029323829337955
Train Epoch: 1 [400/2022 (20%)] Loss: 0.000023679231163
Train Epoch: 1 [800/2022 (40%)] Loss: 0.000008381257430
Train Epoch: 1 [1200/2022 (59%)]    Loss: 0.000014029804333
Train Epoch: 1 [1600/2022 (79%)]    Loss: 0.000005878804131
Train Epoch: 1 [2000/2022 (99%)]    Loss: 0.000002487727670
Train Epoch: 2 [0/2022 (0%)]    Loss: 0.000003322811153
Train Epoch: 2 [400/2022 (20%)] Loss: 0.000004343680757
Train Epoch: 2 [800/2022 (40%)] Loss: 0.000004866438303
Train Epoch: 2 [1200/2022 (59%)]    Loss: 0.000002030236828
Train Epoch: 2 [1600/2022 (79%)]    Loss: 0.000002176634098
Train Epoch: 2 [2000/2022 (99%)]    Loss: 0.000003391759947
Train Epoch: 3 [0/2022 (0%)]    Loss: 0.000001647578983
Train Epoch: 3 [400/2022 (20%)] Loss: 0.000001305555315
Train Epoch: 3 [800/2022 (40%)] Loss: 0.000002374987162
Train Epoch: 3 [1200/2022 (59%)]    Loss: 0.000001534210355
Train Epoch: 3 [1600/2022 (79%)]    Loss: 0.000004802184321
Train Epoch: 3 [2000/2022 (99%)]    Loss: 0.000004886873739
Train Epoch: 4 [0/2022 (0%)]    Loss: 0.000009389250408
Train Epoch: 4 [400/2022 (20%)] Loss: 0.000015918551071
Train Epoch: 4 [800/2022 (40%)] Loss: 0.000004416628599
Train Epoch: 4 [1200/2022 (59%)]    Loss: 0.000000648438970
Train Epoch: 4 [1600/2022 (79%)]    Loss: 0.000000662304274
Train Epoch: 4 [2000/2022 (99%)]    Loss: 0.000009007605513
Train Epoch: 5 [0/2022 (0%)]    Loss: 0.000003984898285
Train Epoch: 5 [400/2022 (20%)] Loss: 0.000001393861794
Train Epoch: 5 [800/2022 (40%)] Loss: 0.000004186821116
Train Epoch: 5 [1200/2022 (59%)]    Loss: 0.000000712396343
Train Epoch: 5 [1600/2022 (79%)]    Loss: 0.000000592076617
Train Epoch: 5 [2000/2022 (99%)]    Loss: 0.000000580793312
Train Epoch: 6 [0/2022 (0%)]    Loss: 0.000000495225947
Train Epoch: 6 [400/2022 (20%)] Loss: 0.000000627484724
Train Epoch: 6 [800/2022 (40%)] Loss: 0.000000910183417
Train Epoch: 6 [1200/2022 (59%)]    Loss: 0.000000620817559
Train Epoch: 6 [1600/2022 (79%)]    Loss: 0.000002660205155
Train Epoch: 6 [2000/2022 (99%)]    Loss: 0.000001788120585
Train Epoch: 7 [0/2022 (0%)]    Loss: 0.000002246752956
Train Epoch: 7 [400/2022 (20%)] Loss: 0.000000400816077
Train Epoch: 7 [800/2022 (40%)] Loss: 0.000000314132564
Train Epoch: 7 [1200/2022 (59%)]    Loss: 0.000005796841833
Train Epoch: 7 [1600/2022 (79%)]    Loss: 0.000000317658902
Train Epoch: 7 [2000/2022 (99%)]    Loss: 0.000000164305803
Train Epoch: 8 [0/2022 (0%)]    Loss: 0.000000189095587
Train Epoch: 8 [400/2022 (20%)] Loss: 0.000000214353605
Train Epoch: 8 [800/2022 (40%)] Loss: 0.000000203828307
Train Epoch: 8 [1200/2022 (59%)]    Loss: 0.000000283352165
Train Epoch: 8 [1600/2022 (79%)]    Loss: 0.000000136634853
Train Epoch: 8 [2000/2022 (99%)]    Loss: 0.000000144215477
Train Epoch: 9 [0/2022 (0%)]    Loss: 0.000000270883049
Train Epoch: 9 [400/2022 (20%)] Loss: 0.000000126488899
Train Epoch: 9 [800/2022 (40%)] Loss: 0.000000238181570
Train Epoch: 9 [1200/2022 (59%)]    Loss: 0.000000171188475
Train Epoch: 9 [1600/2022 (79%)]    Loss: 0.000000167808679
Train Epoch: 9 [2000/2022 (99%)]    Loss: 0.000000133901352
Train Epoch: 10 [0/2022 (0%)]   Loss: 0.000000186704639
Train Epoch: 10 [400/2022 (20%)]    Loss: 0.000000156324205
Train Epoch: 10 [800/2022 (40%)]    Loss: 0.000000202804600
Train Epoch: 10 [1200/2022 (59%)]   Loss: 0.000000140662607
erikkockar commented 2 years ago

Hi,

Thanks a lot for that dont you have maybe link for this script ?

Thanks


From: HiHiAllen @.> Sent: Saturday, February 19, 2022 12:44 PM To: NVlabs/Deep_Object_Pose @.> Cc: Erik Kockar @.>; Author @.> Subject: Re: [NVlabs/Deep_Object_Pose] RuntimeError: CUDA out of memory (Training phase) (Issue #220)

You don't often get email from @.*** Learn why this is importanthttp://aka.ms/LearnAboutSenderIdentification

the loss looks really bad, I'm afraid the training doesn't really work. I have met this before , my dataset was generated at will ,and I suggest the background picture of your dataset should try to fit your work scene. you can see the belief maps to figure out whether the training is working. I run this script to see the belief maps.

'''

import cv2 import matplotlib.pyplot as plt import torch import os import numpy as np from inference.cuboid import Cuboid3d from inference.cuboid_pnp_solver import CuboidPNPSolver from inference.detector import ModelData, ObjectDetector import yaml from PIL import Image from PIL import ImageDraw from torch.autograd import Variable import torchvision.transforms as transforms import time os.environ['KMP_DUPLICATE_LIB_OK']='True'

Settings

name = 'flange_plate'

net_path = 'data/net/mustard_60.pth'

net_path = '/home/zxm/usb/ur5/src/DOPE-ROS-D435-dependabot-pip-dope-opencv-python-4.2.0.32/dope/weights/cracker_60.pth' gpu_id = 0 img_path = '/home/zxm/usb/ur5/src/DOPE-ROS-D435-dependabot-pip-dope-opencv-python-4.2.0.32/dope/scripts/cola_image/023.png'

img_path = 'data/images/cautery_real_1.jpg'

transform = transforms.Compose([

transforms.Scale(IMAGE_SIZE),

transforms.CenterCrop((imagesize,imagesize)),

transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])

Function for visualizing feature maps

def viz_layer(layer, n_filters=9): fig = plt.figure(figsize=(20, 20)) for i in range(n_filters): ax = fig.add_subplot(4, 5, i + 1, xticks=[], yticks=[])

grab layer outputs

ax.imshow(np.squeeze(layer[i].data.numpy()), cmap='gray') ax.set_title('Output %s' % str(i + 1))

load color image

in_img = cv2.imread(img_path)

in_img = cv2.resize(in_img, (640, 480))

in_img = cv2.cvtColor(in_img, cv2.COLOR_BGR2RGB)

plot image

plt.imshow(in_img)

model = ModelData(name, net_path, gpu_id) model.load_net_model() net_model = model.net

Run network inference vertex affinity

image_tensor = transform(in_img) image_torch = Variable(image_tensor).cuda().unsqueeze(0)

out, seg = net_model(image_torch)

vertex2 = out[-1][0].cpu() aff = seg[-1][0].cpu()

View the vertex and affinities

viz_layer(vertex2) viz_layer(aff, n_filters=16)

plt.show()

Code to visualize the neural network output

def DrawLine(point1, point2, lineColor, lineWidth): '''Draws line on image''' global g_draw if not point1 is None and point2 is not None: g_draw.line([point1, point2], fill=lineColor, width=lineWidth)

def DrawDot(point, pointColor, pointRadius): '''Draws dot (filled circle) on image''' global g_draw if point is not None: xy = [ point[0] - pointRadius, point[1] - pointRadius, point[0] + pointRadius, point[1] + pointRadius ] g_draw.ellipse(xy, fill=pointColor, outline=pointColor )

def DrawCube(points, color=(255, 0, 0)): ''' Draws cube with a thick solid line across the front top edge and an X on the top face. '''

lineWidthForDrawing = 2

draw front

DrawLine(points[0], points[1], color, lineWidthForDrawing)

DrawLine(points[1], points[2], color, lineWidthForDrawing)

DrawLine(points[3], points[2], color, lineWidthForDrawing)

DrawLine(points[3], points[0], color, lineWidthForDrawing)

draw back

DrawLine(points[4], points[5], color, lineWidthForDrawing)

DrawLine(points[6], points[5], color, lineWidthForDrawing)

DrawLine(points[6], points[7], color, lineWidthForDrawing)

DrawLine(points[4], points[7], color, lineWidthForDrawing)

draw sides

DrawLine(points[0], points[4], color, lineWidthForDrawing)

DrawLine(points[7], points[3], color, lineWidthForDrawing)

DrawLine(points[5], points[1], color, lineWidthForDrawing)

DrawLine(points[2], points[6], color, lineWidthForDrawing)

draw dots

DrawDot(points[0], pointColor=color, pointRadius=4)

DrawDot(points[1], pointColor=color, pointRadius=4)

draw x on the top

DrawLine(points[0], points[5], color, lineWidthForDrawing)

DrawLine(points[1], points[4], color, lineWidthForDrawing)

Settings

config_name = "my_config_realsense.yaml" exposure_val = 166

yaml_path = 'cfg/{}'.format(config_name) with open(yaml_path, 'r') as stream: try: print("Loading DOPE parameters from '{}'...".format(yaml_path)) params = yaml.load(stream) print(' Parameters loaded.') except yaml.YAMLError as exc: print(exc)

models = {}

pnp_solvers = {}

pub_dimension = {}

draw_colors = {}

Initialize parameters

matrix_camera = np.zeros((3,3))

matrix_camera[0,0] = params["camera_settings"]['fx']

matrix_camera[1,1] = params["camera_settings"]['fy']

matrix_camera[0,2] = params["camera_settings"]['cx']

matrix_camera[1,2] = params["camera_settings"]['cy']

matrix_camera[2,2] = 1

dist_coeffs = np.zeros((4,1))

if "dist_coeffs" in params["camera_settings"]:

dist_coeffs = np.array(params["camera_settings"]['dist_coeffs'])

config_detect = lambda: None

config_detect.mask_edges = 1

config_detect.mask_faces = 1

config_detect.vertex = 1

config_detect.threshold = 0.5

config_detect.softmax = 1000

config_detect.thresh_angle = params['thresh_angle']

config_detect.thresh_map = params['thresh_map']

config_detect.sigma = params['sigma']

config_detect.thresh_points = params["thresh_points"]

For each object to detect, load network model, create PNP solver, and start ROS publishers

for model in params['weights']:

models[model] = \

    ModelData(

        model,

        "weights/" + params['weights'][model]

    )

models[model].load_net_model()

draw_colors[model] = tuple(params["draw_colors"][model])

pnp_solvers[model] = \

    CuboidPNPSolver(

        model,

        matrix_camera,

        Cuboid3d(params['dimensions'][model]),

        dist_coeffs=dist_coeffs

    )

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) Copy and draw image

img_copy = in_img.copy() im = Image.fromarray(img_copy) g_draw = ImageDraw.Draw(im)

for m in models:

Detect object

results = ObjectDetector.detect_object_in_image( models[m].net, pnp_solvers[m], in_img, config_detect )

Overlay cube on image

for i_r, result in enumerate(results):

if result["location"] is None:

    continue

loc = result["location"]

ori = result["quaternion"]

# Draw the cube

if None not in result['projected_points']:

    points2d = []

    for pair in result['projected_points']:

        points2d.append(tuple(pair))

    DrawCube(points2d, draw_colors[m])

open_cv_image = np.array(im)

open_cv_image = cv2.cvtColor(open_cv_image, cv2.COLOR_RGB2BGR) cv2.imshow('Open_cv_image', open_cv_image) cv2.waitKey(0)

plt.imshow(open_cv_image)

plt.show()

'''

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVlabs%2FDeep_Object_Pose%2Fissues%2F220%23issuecomment-1045998259&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473687481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pUIXW0DhliCWgzrncsYLPveL%2Fp%2BHWCS1H2q2FwnWO0g%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARWAE3WMTQJXFODLQRO2WJLU357ADANCNFSM5OYJKXGQ&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473687481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wQ%2FzkUr13XIc6S5RkASICxtgukvbyJ5Lms10EWHY7Rg%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473843685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iXfFYiMW9SjxO6qEuqXnCqVQGFMP7OYecRamyWykX1Y%3D&reserved=0 or Androidhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cerkou20%40student.sdu.dk%7C364f62c4f58049a7e86308d9f39d2040%7C9a97c27db83e4694b35354bdbf18ab5b%7C0%7C0%7C637808678473843685%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ve28%2FN8KCkaeb8lwXDsYx8wcJQ4LbsiGbFX%2BvvLJTFY%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

HiHiAllen commented 2 years ago

the loss looks really bad, I'm afraid the training doesn't really work. I suggest the background picture should try to fit the object's working scene. and dope has difficulty solving the problem of symmetry. you can see the belief maps to figure out whether the training is working. I use the python file below to see the belief maps: inference_on_img.txt

erikkockar commented 2 years ago

I am trying learning on this object: Screenshot from 2022-02-18 16-56-48 )

I might probably add more visual features. Do you think that this would work ?

Btw code inference_on_img.txt is giving me error

erikkockar@erikkockar:~/catkin_ws/src/Deep_Object_Pose/scripts$ python3 belief.py 
Traceback (most recent call last):
  File "belief.py", line 6, in <module>
    from inference.cuboid import Cuboid3d
ModuleNotFoundError: No module named 'inference.cuboid'
HiHiAllen commented 2 years ago

there is a folder "inference", (dope/src/inference) ,copy and put it in your path :~/catkin_ws/src/Deep_Object_Pose/scripts

HiHiAllen commented 2 years ago

sorry,I'm also a new hand, can't offer more help. hhh~ Let's wait for others

mintar commented 2 years ago

Perhaps the resolution of the training images is too high. Try scaling down the shorter side to about 500 px.

blaine141 commented 2 years ago

Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less.

For the low loss, I would make sure you have named your object correctly. There is a chance you are telling it to learn nothing. To verify that you can look at the belief maps

TontonTremblay commented 2 years ago

I have been training dope on different dgx system with v100 and p100. So yes a small batch size for a single gpu is normal. Also you could look into using sgd optimizer instead of Adam. This would also use less memory. But I warn you that the optimization will be harder. I never succeeded training dope with sgd.

On Sun, Feb 20, 2022 at 07:40 Blaine Miller @.***> wrote:

Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less.

— Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/220#issuecomment-1046264434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIE2M464RWBJ56G5FJ3U4EDORANCNFSM5OYJKXGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

erikkockar commented 2 years ago

sorry,I'm also a new hand, can't offer more help. hhh~ Let's wait for others @HiHiAllen Thanks it helped.

Perhaps the resolution of the training images is too high. Try scaling down the shorter side to about 500 px. @mintar Thanks you were right about it. My output images from ndds are now 640x480 (my realsense is set to this resolution) and it improved loss from e-5 to e-2.

1, 0,0.041358746588230
2, 0,0.011250677518547
3, 0,0.010450199246407
4, 0,0.007594730239362
...
7, 0,0.003804703243077

Still think that its too low though. Also thinking about to go maybe lower around 320x240 what do you think about that.

Hey. A batch size of 4 does not seem weird to me. The 64 was originally split across multiple GPUs and a single GPU system will handle much less. @blaine141 Yes. Now I get it thanks. I am able to run max batch size of 6 and 20 epochs takes around 3 hours so its still okeyish.

For the low loss, I would make sure you have named your object correctly. There is a chance you are telling it to learn nothing. To verify that you can look at the belief maps Yes you were right i checked _object_settings.json and it was empty. I exported new correct dataset and my beliefs maps after 7 epochs looks like this.

Screenshot from 2022-02-21 10-51-07

But after applying those weights into dope my realsense camera still can not detect the object at all at 640x480 and 10fps.

blaine141 commented 2 years ago

I also got this snag the first time I trained. Off the top of my head I feel like I needed to train longer. I normally do 40 hours on a similar GPU. I will have to look back at my issues to see if it was anything else

erikkockar commented 2 years ago

little update :

I down scaled training images to 320x240 and created 5000 samples with NDDS. Thanks to downscaling I was able to run it on batch size of 8 (Nvidia RTX 3060 - laptop) by command: python3 train.py --batchsize 8 --data ~/ndds2/Dataset_Synthesizer/Source/NVCapturedData/TestCapturer/ --object neuron --outf neurontest --gpuids 0 After 35 epochs (whole night of training so 8hrs approximately) I am able to detect my robot in the environment, still lacking to detect from certain angles but that is matter of more training.

Thanks a lot to everyone If there is nothing more from you this issue can be closed.

mintar commented 2 years ago

Be careful: if you downscale your training images to 320x240, you should also later run inference at a similar resolution. You will get better results if you train at 640x480 and run inference at the same resolution. DOPE performs poorly if the "size in pixels" of an object (e.g., 100 pixels high) occurs in the test set, but not the training set. So if you have objects in the training set that are pretty close to the camera (so they are around 200 pixels high at 320x240 resolution), you will not be able to detect objects at a similar distance if you use a 640x480 resolution during testing (because then the object would be around 400 pixels high).

Of course you can also downscale the inference images, but that makes the objects harder to detect (especially if they are far away from the camera).