NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.02k stars 287 forks source link

failed detection of the custom object #151

Open mbajlo opened 3 years ago

mbajlo commented 3 years ago

Hello all, I have read a lot of topics here, it got me pretty far, but now I am stuck. I am trying to make train the network on my custom object which is made using the blender, exported and imported in the NDDS as fbx file. In the NDDS I have made the train dataset of 200 images and the test dataset of 50 images. the _camera_settings.json looks like this:

{
    "camera_settings": [
        {
            "name": "Viewpoint",
            "horizontal_fov": 90,
            "intrinsic_settings":
            {
                "resX": 400,
                "resY": 400,
                "fx": 200,
                "fy": 200,
                "cx": 200,
                "cy": 200,
                "s": 0
            },
            "captured_image_size":
            {
                "width": 400,
                "height": 400
            }
        }
    ]
}

the _object_settings looks like this:

{
    "exported_object_classes": [
        "thorHammer_387"
    ],
    "exported_objects": [
        {
            "class": "thorHammer_387",
            "segmentation_class_id": 255,
            "segmentation_instance_id": 16777214,
            "fixed_model_transform": [
                [ 0, 0, 1, 0 ],
                [ -1, 0, 0, 0 ],
                [ 0, -1, 0, 0 ],
                [ 0, -100, -420, 1 ]
            ],
            "cuboid_dimensions": [ 8.1920995712280273, 38.728199005126953, 13.551199913024902 ]
        }
    ]
}

the 000019.json(the other ones are quite similar) for example looks like this:

{
    "camera_data":
    {
        "location_worldframe": [ -405.77529907226563, 9.6858997344970703, 100 ],
        "quaternion_xyzw_worldframe": [ 0, 0, -0.95569998025894165, 0.29449999332427979 ]
    },
    "objects": [
        {
            "class": "thorHammer_387",
            "instance_id": 16777214,
            "visibility": 1,
            "location": [ 0, 0, 17.209299087524414 ],
            "quaternion_xyzw": [ 0, 0.95569998025894165, 0, 0.29449999332427979 ],
            "pose_transform": [
                [ 0.56279999017715454, 0, -0.82660001516342163, 0 ],
                [ -0.82660001516342163, 0, -0.56279999017715454, 0 ],
                [ 0, -1, 0, 0 ],
                [ 0, 0, 17.209299087524414, 1 ]
            ],
            "cuboid_centroid": [ 0.2175000011920929, 8.7371997833251953, 17.251399993896484 ],
            "projected_cuboid_centroid": [ 202.52149963378906, 301.29278564453125 ],
            "bounding_box":
            {
                "top_left": [ -1.6863000392913818, 134.03680419921875 ],
                "bottom_right": [ 653.14111328125, 298.7296142578125 ]
            },
            "cuboid": [
                [ 0.64539998769760132, -10.626899719238281, 9.3454999923706055 ],
                [ 7.4166998863220215, -10.626899719238281, 13.956299781799316 ],
                [ 7.4166998863220215, 28.101299285888672, 13.956299781799316 ],
                [ 0.64539998769760132, 28.101299285888672, 9.3454999923706055 ],
                [ -6.9816999435424805, -10.626899719238281, 20.546600341796875 ],
                [ -0.21040000021457672, -10.626899719238281, 25.15730094909668 ],
                [ -0.21040000021457672, 28.101299285888672, 25.15730094909668 ],
                [ -6.9816999435424805, 28.101299285888672, 20.546600341796875 ]
            ],
            "projected_cuboid": [
                [ 213.81129455566406, -27.421699523925781 ],
                [ 306.28469848632813, 47.711898803710938 ],
                [ 306.28469848632813, 602.70489501953125 ],
                [ 213.81129455566406, 801.38519287109375 ],
                [ 132.04010009765625, 96.558197021484375 ],
                [ 198.32730102539063, 115.51670074462891 ],
                [ 198.32730102539063, 423.40451049804688 ],
                [ 132.04010009765625, 473.53750610351563 ]
            ]
        }
    ]
}

the 000019.png looks like this: 000019

I have run the nvdu_viz tool on the dataset and it looks just fine, for example this is the image 000019.png: image

my header.txt file looks like this: Namespace(batchsize=50, data='C:\\Users\\m\\Desktop\\ML\\ThorHammer_387mm_dataset\\train\\yaw\\thorHammer_YawAngle\\', datasize=None, datatest='C:\\Users\\m\\Desktop\\ML\\ThorHammer_387mm_dataset\\test\\yaw\\thorHammer_YawAngle\\', epochs=60, gpuids=[0], imagesize=400, loginterval=100, lr=0.0001, manualseed=3543, namefile='epoch', nbupdates=None, net='', noise=2.0, object='thorHammer_387', option='default', outf='train_ThorHammer_testBeliefExport', pretrained=False, save=False, sigma=4, workers=8)seed: 3543

I have made 21 epochs. At the epoch 1 the train loss has been 0.032777167856693 and at the epoch 21 the train loss is 0.000019518713088 , the test loss is similiar, it goes from 0.009614795446396 at the epoch 1 to 0.000018934417312 at the epoch 21.

After the training I decided to test my network on the images I have used for training and testing. But this is where the things began to look strange. I am not getting the detection at all.

Then I added this code to the ObjectDetector class and used it to to see the belief maps

    @staticmethod
    def detect_object_in_image(net_model, pnp_solver, in_img, config, grid_belief_debug=False, norm_belief=True, run_sampling=False, network='dope'):
        '''Detect objects in a image using a specific trained network model'''

        if in_img is None:
            return []

        if network == 'full':
            scale_factor = 1
            OFFSET_DUE_TO_UPSAMPLING = 0 
        else: # 'dope' and 'mobile'
            scale_factor = 8
            # OFFSET_DUE_TO_UPSAMPLING = 0.4395
            OFFSET_DUE_TO_UPSAMPLING = 0

        # Run network inference
        # print(in_img.shape)
        image_tensor = transform(in_img)
        # image_torch = Variable(image_tensor).cuda().unsqueeze(0)
        image_torch = Variable(image_tensor).cpu().unsqueeze(0)
        # print(image_torch.shape)
        out, seg = net_model(image_torch)  # run inference using the network (calls 'forward' method)
        vertex2 = out[-1][0]
        aff = seg[-1][0]

        # Find objects from network output
        detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config
            # run_sampling=run_sampling,
            # scale_factor = scale_factor,
            # OFFSET_DUE_TO_UPSAMPLING = OFFSET_DUE_TO_UPSAMPLING
            )

        if not grid_belief_debug: 
            return detected_objects, None
        else:
            # Run the belief maps debug display on the beliefmaps
            upsampling = nn.UpsamplingNearest2d(scale_factor=scale_factor)
            tensor = vertex2 # shape [9, 50, 50]
            belief_imgs = []
            in_img = (torch.tensor(in_img).float()/255.0)
            in_img *= 0.7            

            for j in range(tensor.size()[0]):
                belief = tensor[j].clone()
                if norm_belief:
                    belief -= float(torch.min(belief).item())
                    belief /= float(torch.max(belief).item())

                # print (image_torch.size())
                # raise()    
                # belief *= 0.5
                # print(in_img.size())
                belief = upsampling(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data 
                belief = torch.clamp(belief, 0, 1).cpu()  
                belief = torch.cat([
                            # belief.unsqueeze(0) + in_img[:,:,0],
                            # belief.unsqueeze(0) + in_img[:,:,1],
                            # belief.unsqueeze(0) + in_img[:,:,2]
                            belief.unsqueeze(0),
                            belief.unsqueeze(0),
                            belief.unsqueeze(0)
                            ]).unsqueeze(0)
                belief = torch.clamp(belief, 0, 1) 

                # belief_imgs.append(belief.data.squeeze().cpu().numpy().transpose(1,2,0))
                belief_imgs.append(belief.data.squeeze().numpy())

            # Create the image grid
            belief_imgs = torch.tensor(np.array(belief_imgs))

            im_belief = ObjectDetector.get_image_grid(belief_imgs, None, mean=0, std=1)

            return detected_objects, im_belief

I have used it on the jpg image 000019.jpg in separate script like this:

        # read the image(jpg) on which the network should be tested. 
        # example: 
        # C:\\Users\\m\\Desktop\\000044.jpg
        pathToImg = input("enter path to the image:")
        print("path to the image is: {}".format(pathToImg))
        img = cv2.imread(pathToImg)
        cv2.imshow('img', img)
        cv2.waitKey(1)

        for m in models:
            # try to detect object
            results, im_belief = ObjectDetector.detect_object_in_image(models[m].net, pnp_solvers[m], img, config_detect, grid_belief_debug=True, norm_belief=True, run_sampling=True)

            print("objects found: {}".format(results))
            cv_imageBelief = np.array(im_belief)
            imageToShow = cv2.resize(cv_imageBelief, dsize=(800, 800))
            cv2.imshow('beliefMaps', imageToShow)
            cv2.waitKey(0)      

        print("end")

if __name__ == '__main__':
    main()

And the result I am getting back doesn't look good at all: image

So I tried to figure out what went wrong, because the error looks quites low, but it seems it cannot detect the object on the training/test images. I followed this topic which says I need to generate belief maps on training.

So I tried to train the network again and save the Variable tensors like this:

target_belief = Variable(targets['beliefs'].cpu())
torch.save(target_belief, 'target_belief_{}.pt'.format(epoch))

target_affinity = Variable(targets['affinities'].cpu())
torch.save(target_affinity, 'target_affinity_{}.pt'.format(epoch))

this way I have the tensors of the variables and I could try to show them later, which I did like this, I am not sure if it correct?

def targetBelif_show(beliefTensor, norm_belief=False, scale_factor=8):
    '''show target belief maps'''
    upsampling = nn.UpsamplingNearest2d(scale_factor=scale_factor)
    tensor = beliefTensor 
    belief_imgs = []          

    for j in range(tensor.size()[0]):
        belief = tensor[j].clone()
        if norm_belief:
            belief -= float(torch.min(belief).item())
            belief /= float(torch.max(belief).item())

        belief = upsampling(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data 
        belief = torch.clamp(belief, 0, 1).cpu()  
        belief = torch.cat([belief.unsqueeze(0), belief.unsqueeze(0), belief.unsqueeze(0)]).unsqueeze(0)
        belief = torch.clamp(belief, 0, 1)  

        belief_imgs.append(belief.data.squeeze().numpy())

    belief_imgs = torch.tensor(np.array(belief_imgs))

    im_belief = ObjectDetector.get_image_grid(belief_imgs, None, mean=0, std=1)

    return im_belief

def main():
    '''main function'''

    # open one of the saved target belief tensors (saved for each epoch while training)
    with open("C:\\Users\\m\\Desktop\\debug\\target_belief_20.pt", 'rb') as f:
        buffer = io.BytesIO(f.read())
    LodedTensor = torch.load(buffer, map_location='cpu')

    # show the target belief for each batch
    count = 0
    for t in LodedTensor:
        print ("tensor_{}, {}".format(count, t.shape))
        count += 1
        im_belief = targetBelif_show(t, norm_belief=True)
        cv_imageBelief = np.array(im_belief)
        imageToShow = cv2.resize(cv_imageBelief, dsize=(800, 800))
        cv2.imshow('beliefMaps', imageToShow)
        cv2.waitKey(0)    

main()

But then I have seen that the targetBeliefMaps are all the same no matter which epoch tensor I try to read and I would say these are not looking good.

image

I know I am missing something big and I am not sure if I am showing the target belief maps from training correctly, can you help or point me in the right direction?

Thanks, any help is appreciated.

TontonTremblay commented 3 years ago

Thank you for the detailed information. It sounds like things should work out of the box, but they are not. I would say that the example with hammer you share, the hammer is too close to the camera, you can only see a 2 keypoints. Is your data all looking like this one. I have noticed that if the keypoints are not all seen by the image, the architecture is hard to train to produce valid detection. Maybe try with a smaller hammer appearance .

It looks like the training data is empty as well, I am not sure why. You could check in the dataloader if the belief maps are correct at that level.

mbajlo commented 3 years ago

Hello @TontonTremblay , thank for the reply The image I have showed is just one of the many in the sequence. For example this is how the image 000120 looks like: image I am changing the distance and the angle.

Also I have tried to use the option Auto Start Capturing in the SceneCapturer_Simple and then it start from the center of the object's coordinate system. I have also tried to create a dataset without the Auto Start Capturing option, then I have to start capturing by myself but the situation is the same. I am using the OrbitalMovement to change the Yaw angle and I was thinking to use the network trained on the Yaw angle to retrain later with the Pitch angle, because I couldn't find a way to change both angles at the same time. If I want to change both values in the NDDS it looks like the Yaw angle has a priority.

When you say 2 keypoints, what do you mean?

In the video data_overview.mp4 it says (around 20sec) if the visibility tag is 0 is fully visible and if it is 1 then it is fully invisible, on all my .json files I have value 1(fully invisible), how can I change it to be 0? Could this cause the empty training data?

I tried to debug the train.py to check training data, but I can see the path given in the --data variable is used and it recognizes the imagesand belonging .json image

Do you see something wrong here? It seems it populated the trainingdata just fine for me. The same is worth for the testdata

This is how the output looks like when I start training:

start: 08:34:17.726571
load data
training data: 4 batches
testing data: 1 batches

So this line printed the line training data: 4 batches which means the trainingdata is not noneat least.

Is there an explanation how to show belief maps at the dataloaderlevel? Could you help me with that?

Thanks

TontonTremblay commented 3 years ago

Something wrong is going on, I am not sure what. How do you call the training script?

mbajlo commented 3 years ago

hello @TontonTremblay ,

Thanks for the reply, the training script name is the same, I kept it as train.py. Do you think the name could make a difference?

TontonTremblay commented 3 years ago

How do you call it. Like python train.py--data.... ? (Sorry phone answer)

On Tue, Nov 24, 2020 at 09:19 mbajlo notifications@github.com wrote:

hello @TontonTremblay https://github.com/TontonTremblay ,

Thanks for the reply, the training script name is the same, I kept it as train.py. Do you think the name could make a difference?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/151#issuecomment-733121337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIAV62NTBXSF2WPUPCTSRPTLXANCNFSM4T55POFQ .

mbajlo commented 3 years ago

Hello @TontonTremblay

You meant how do I execute the training script, I do it like this from the folder where the train.py is:

python train.py --data C:\Users\m\Desktop\ML\ThorHammer_387mm_dataset\train\yaw\thorHammer_YawAngle\ --batchsize 50 --pretrained False --datatest C:\Users\m\Desktop\ML\ThorHammer_387mm_dataset\test\yaw\thorHammer_YawAngle\ --imagesize 400 --object thorHammer_387 --workers 8 --outf markoThorHammer_testBeliefExport

Or if do the debug in the VS code I just execute the launch configuration with the same args given before, but it is basically the same thing. `

TontonTremblay commented 3 years ago

In your dataset do you only have the hammer? If so remove the --object in the python.

On Tue, Nov 24, 2020 at 10:50 mbajlo notifications@github.com wrote:

Hello @TontonTremblay https://github.com/TontonTremblay

You meant how do I execute the training script, I do it like this from the folder where the train.py is:

python train.py --data C:\Users\m\Desktop\ML\ThorHammer_387mm_dataset\train\yaw\thorHammer_YawAngle\ --batchsize 50 --pretrained False --datatest C:\Users\m\Desktop\ML\ThorHammer_387mm_dataset\test\yaw\thorHammer_YawAngle\ --imagesize 400 --object thorHammer_387 --workers 8 --outf markoThorHammer_testBeliefExport

Or if do the debug in the VS code I just execute the launch configuration with the same args given before, but it is basically the same thing. `

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/151#issuecomment-733167707, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JICFJSZMHQSCF72BOX3SRP575ANCNFSM4T55POFQ .

mbajlo commented 3 years ago

Yes that's right, I have only the hammer. I will remove the --object and try again. I will let you know what happened.

Thanks

mbajlo commented 3 years ago

Hello @TontonTremblay

I have retrained the network again, the target belief maps are now looking like this:

image

So I would say it looks pretty good.

But (there is always a but :D ) after I try to test my network on the images from the train or test datasets or if I make a new image of the hammer, the results list is always empty and the belief maps are looking something like this:

image

I tried to train the network with more than one object in the image, or by changing the yaw angle or the pitch angle. The target belief maps are looking always getting back as in the first image, but I cannot detect an object on the image from the train or test dataset and the belief maps are looking strange like above.

Do you know what could be wrong, what should be my next step?

TontonTremblay commented 3 years ago

On the phone again.

I think you need to train for longer. I normally train for 60 epochs on a dataset of 120k images and with a batch size of 128 on 4 p100.

On Fri, Nov 27, 2020 at 13:13 mbajlo notifications@github.com wrote:

Hello @TontonTremblay https://github.com/TontonTremblay

I have retrained the network again, the target belief maps are now looking like this:

[image: image] https://user-images.githubusercontent.com/15787947/100484113-6dae5c00-30fb-11eb-85c0-473a7561edcb.png

So I would say it looks pretty good.

But (there is always a but :D ) after I try to test my network on the images from the train or test datasets or if I make a new image of the hammer, the results list is always empty and the belief maps are looking something like this:

[image: image] https://user-images.githubusercontent.com/15787947/100484555-3476eb80-30fd-11eb-8290-0b0b318dee8b.png

I tried to train the network with more than one object in the image, or by changing the yaw angle or the pitch angle. The target belief maps are looking always getting back as in the first image, but I cannot detect an object on the image from the train or test dataset and the belief maps are looking strange like above.

Do you know what could be wrong, what should be my next step?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/151#issuecomment-734991010, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIBPMVCRP2RVCKKQFDTSSAJAPANCNFSM4T55POFQ .

mbajlo commented 3 years ago

Hello @TontonTremblay

I thought you are using the NVidia GPUs, but I have the a Radeon GPU so I can train only with my CPU. Just for testing I have made a training dataset of 100 images of the hammer (and the test dataset of 20 images), all images have been taken from the same position of the camera (there are no transport, no movement and not angle change), when I train the network for 60 epochs I get this response when I test the network:

list of found objects: []

image

image

Do you think the problem could be in visibility , should it be 0 for full visibility? The image size is 400x400, should I make it 800x800 or more?

{
    "camera_data":
    {
        "location_worldframe": [ -990, -80, 21.502099990844727 ],
        "quaternion_xyzw_worldframe": [ 0, 0, 0, 1 ]
    },
    "objects": [
        {
            "class": "thorHammer_387",
            "instance_id": 16777200,
            "visibility": 1,
            "location": [ 0, -4.4168000221252441, 49.195098876953125 ],
            "quaternion_xyzw": [ 0, -0.70709997415542603, 0, 0.70709997415542603 ],
            "pose_transform": [
                [ -1, 0, 0, 0 ],
                [ 0, 0, 1, 0 ],
                [ 0, -1, 0, 0 ],
                [ 0, -4.4168000221252441, 49.195098876953125, 1 ]
            ],
            "cuboid_centroid": [ -0.087600000202655792, 4.3204998970031738, 48.991600036621094 ],
            "projected_cuboid_centroid": [ 199.64230346679688, 217.63749694824219 ],
            "bounding_box":
            {
                "top_left": [ 134.22529602050781, 169.74690246582031 ],
                "bottom_right": [ 296.0574951171875, 229.5238037109375 ]
            },
            "cuboid": [
                [ -6.8632998466491699, -15.043600082397461, 53.087600708007813 ],
                [ -6.8632001876831055, -15.043600082397461, 44.895599365234375 ],
                [ -6.8632001876831055, 23.684499740600586, 44.895599365234375 ],
                [ -6.8632998466491699, 23.684499740600586, 53.087600708007813 ],
                [ 6.6880002021789551, -15.043600082397461, 53.087699890136719 ],
                [ 6.6880002021789551, -15.043600082397461, 44.895599365234375 ],
                [ 6.6880002021789551, 23.684499740600586, 44.895599365234375 ],
                [ 6.6880002021789551, 23.684499740600586, 53.087699890136719 ]
            ],
            "projected_cuboid": [
                [ 174.14370727539063, 143.32530212402344 ],
                [ 169.42579650878906, 132.98399353027344 ],
                [ 169.42579650878906, 305.50930786132813 ],
                [ 174.14370727539063, 289.22799682617188 ],
                [ 225.19599914550781, 143.32539367675781 ],
                [ 229.79359436035156, 132.98399353027344 ],
                [ 229.79359436035156, 305.50930786132813 ],
                [ 225.19599914550781, 289.2279052734375 ]
            ]
        }
    ]
}

Or should I just make a dataset with more than 1000 images and then train the network for 60 epochs?

Could it be the losses of the epoch 60 are still too high? test loss 60, 0,0.007053715176880
train loss 60, 0,0.007159050088376

How low the loss should be?

TontonTremblay commented 3 years ago

Well I am not sure what to day here, you need a fairly large dataset 5k to 60k and you need to train for 2-3 days. I am sorry I am not sure CPU train will work...