NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.01k stars 283 forks source link

Training and evaluation on YCB-videos #187

Open 520xyxyzq opened 2 years ago

520xyxyzq commented 2 years ago

Hi, could you please release the code for training and evaluation (e.g. the accuracy-threshold curves) on the YCB video dataset as described in the CoRL 18 paper? If we have a public dataset that has images, object ids and poses. How should we re-format the data or change the training script to train DOPE?

TontonTremblay commented 2 years ago

We use the ycb toolkit to compute the metrics. DOPE uses the projected cuboid data, you should look into the FAT dataset which is compatible. You can look into this example for how to do it more proceduraly https://github.com/owl-project/NVISII/blob/master/examples/21.ndds_export.py

520xyxyzq commented 2 years ago

Hi Tonton, thank you for your reply! I wonder if I can ask for the synthetic data you used to train the ycb object pose estimators (like the sugar box, etc.). I'm interested in reproducing the training process and see how the params would influence the results. Thank you!

TontonTremblay commented 2 years ago

I would love to, but for reasons outside my control I lost the data. I have since then using nvisii was able to generate data that have similar perf.

520xyxyzq commented 2 years ago

Thank you! I will check it out. BTW, have you tried retraining the ycb object pose estimators using nvisii?

TontonTremblay commented 2 years ago

Yes I have used nvisii to retrain the spam box. I could probably share that data if you want?

On Thu, Oct 7, 2021 at 09:35 Ziqi Lu @.***> wrote:

Thank you! I will check it out. BTW, have you tried retraining the ycb object pose estimators using nvisii?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/187#issuecomment-937964640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIHMSRDDCXCX4XMVAQTUFXD4NANCNFSM5EPHE7RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

520xyxyzq commented 2 years ago

That would be great!

TontonTremblay commented 2 years ago

https://drive.google.com/file/d/1Q5VLnlt1gu2pKIAcUo9uzSyWw1nGlSF8/view?usp=sharing you might need to update the dataloader. But this is the data I used to train DOPE for this paper. https://arxiv.org/abs/2105.13962

520xyxyzq commented 2 years ago

Hi Tonton, thank you so much for sharing the data. It appears that the data is not compatible with the DOPE training script and NVDU. Taking a closer look, I think there are some naming and unit changes in the json files etc. I wonder if you're going to (or may have?) release the new training script for NVISII generated data? (e.g.: the one used for this NVISII paper) This will be helpful for my reproducing the results. Many thanks!

TontonTremblay commented 2 years ago

I have quite a bit of changes to DOPE training script for nvisii paper. I will upload the update version as well later today. If you want to make the data from the script I just uploaded with nvdu, it would not be that hard, I just do not have the time to really dig into it, I am happy to take a PR from you if you end up doing it though.

TontonTremblay commented 2 years ago

Ok here is the code I used and the inference code I also used. https://github.com/NVlabs/Deep_Object_Pose/tree/master/scripts/train2 It is also compatible with the new scripts I shared.

520xyxyzq commented 2 years ago

Thank you very much for the prompt reply and sharing the code! Really appreciate that! I'll try it out and report back. Just to confirm, did you use the default param.s in the script as your training hyper-param.s, e.g. learning rate, noise, etc.?

TontonTremblay commented 2 years ago

Yes I believe so. I would need to double check the serveurs. But I am pretty certain I would not have played with these for the paper. Please let me know how it goes.

On Mon, Dec 13, 2021 at 18:15 Ziqi Lu @.***> wrote:

Thank you very much for the prompt reply and sharing the code! Really appreciate that! I'll try it out and report back. Just to confirm, did you use the default param.s in the script as your training hyper-param.s, e.g. learning rate, noise, etc.?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/187#issuecomment-993090885, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIFQHRM3QVDFE6HCZA3UQ2SEDANCNFSM5EPHE7RA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

520xyxyzq commented 2 years ago

No problem, thank you!

520xyxyzq commented 2 years ago

Another quick question to confirm: In readme.md, the example training command is "... --epochs 2 --batchsize 10 ...". Is this usually the case for training nvisii data? Or in general with more epochs (60) and larger batchsize (32)? Thank you!

TontonTremblay commented 2 years ago

more something like this:

python -m torch.distributed.launch --nproc_per_node=8 train.py --network dope --epochs 10 --batchsize 64 --outf dope_cracker/ --data /cracker/ --objects 003

That is on 8 v100. The code I wrote with the readme is for running out of the box when you clone. I think I have it to serve more as an example rather than reproducing the results sorry about that.

520xyxyzq commented 2 years ago

Hi, Tonton, I really appreciate your prompt responses. Can I ask another quick question regarding the coordinate frame definitions in nvisii generated data?

I'm not sure but I think the coordinate frames are different from those in DOPE. I have been following this document for DOPE's object coordinate system, i.e. with origin shifted to object center, etc. I wonder if nvisii generated data (like the spam can data) are following the same convention?

Also, the "location" data (I believe this is the obj position relative to cam, right?) in json files seem to have negative "z" values. I wonder how is the camera coordinate defined. Is it following the (x right, y down, z out) convention? Thank you!

TontonTremblay commented 2 years ago

You are correct I think. Nvisii uses blender coordinate frames. I am not sure what we did for NDDS again, I know UE4 uses a left hand coordinate frame, but we might have converted it to OpenCV coordinate frame.

https://www.dummies.com/article/technology/software/animation-software/blender/coordinate-systems-in-blender-142885 you have to be careful, there are multiple different frames. The important ones are world and camera, and they are slightly different. The z going in or going out is from clipping views from the rasterization time.

I am not sure, but I also might have screwed up the arrangements of the keypoint orders in the NViSII script compared to NDDS. That is why I posted the code for inference as well, that might help you see what I did.

Also I am sorry to introduce you to the pleasure of wasting days on silly coordinate frames. I did for sure wasted a lot of time on translating coordinate frames. So feel free to post more detail examples here, happy to help. Some ways to debug everything is to use a way to visualize the coordinate frames, like you would do in ROS or you could use Meshcat as well, we wrote some debugger tools (sorry internal for now) to help on that. Good luck or good patience.

520xyxyzq commented 2 years ago

Thank you so much for the reply! By a quick check, I think the cam coordinate convention here is (x right y up z in). And the vertex order is also a little different, but the obj origin is still at center. I'll get back to you after I dig deeper into this. Thank you!

520xyxyzq commented 2 years ago

I think the new vertex order is this:

      3 +-----------------+ 7
       /     TOP         /|
      /                 / |
   0 +-----------------+ 4|
     |      FRONT      |  |
     |                 |  |
     |  x <--+         |  |
     |       |         |  |
     |       v         |  + 6
     |        y        | /
     |                 |/
   1 +-----------------+ 5

which is different from the old NDDS order.

520xyxyzq commented 2 years ago

And I also opened a PR to add testing epochs in training. Only thing weird is tensorboardX sometimes throws Warning:root:NaN or Inf found in input tensor. But I don't see any NaN of Inf in the final weights. Have you ever encountered this before? Thank you!

TontonTremblay commented 2 years ago

The coordinate frame you put looks a little wrong. It looks like the coordinate frame for the camera. But if you are using the world pose, y is up.

Also I never was able to figure out where that warning was caused.

520xyxyzq commented 2 years ago

Hi Tonton, thank you for the reply. I'm still confused about the coordinate systems. I made a simple script to plot the vertices and object axes on a spam can image (001/00080.png). This is what I got (00080_out.png). And I also attached the source code with source data files (debug.zip). An assumption is that the camera frame follows the blender (View) convention (x right y up z in). Not 100% sure if it's true but I think z axis must point inward since z values of "location" are negative. Can you please take a look when you get a chance? Many thanks!! 00080_out debug.zip

TontonTremblay commented 2 years ago

The pose in the camera space uses a different coordinate system than in the world coordinate system. If you use the camera pose. You are correct that z in going outside the view. If you want an opencv convention you add the following transform.

def visii_camera_frame_to_rdf(T_world_Cv):
    """Rotates the camera frame to "right-down-forward"    # Cv = visii camera frame (right-up-back)

    T_Cv_C = np.eye(4)
    T_Cv_C[:3, :3] = transforms3d.euler.euler2mat(np.pi, 0, 0)

    T_world_C = T_world_Cv @ T_Cv_C
    return T_world_C

thanks to @manuelli for this code.

If you do that you will have your pose in opencv coordinate frame. Let me know if this helps, sorry I did not have time to look into your code. I will once work restart if this does not help you. For debugging I have been using meshcat to debug poses, it is pretty easy to add transforms and remove them. You can also use open3d like in this example. https://github.com/owl-project/NVISII/blob/master/examples/19.depth_map_to_point_cloud.py

520xyxyzq commented 2 years ago

Thank you Tonton for the explanation! I think I understand the coordinate frame conventions now.

520xyxyzq commented 2 years ago

Hi Tonton, thank you for merging the PR. I have tried training on the spam can data (on 2 GPUs for 120 epochs) and tested its performance on the YCB-V dataset. Here is an example video. The detections look very noisy. Is that consistent with what you observed? (I noticed that spam can detector also had a worse performance than other objects in the CoRL 18 paper.) (I also tried finetuning the model with YCB-V ground truth and it greatly improved the model's performance on test sets)

I'm wondering if you have other YCB objects' NVISII data at hand? I'm also trying to reproduce the data generation for YCB objects using this script. I wonder if you have any tips or suggestions on that? Can you share what parameters were used for the spam can data? Thank you very much!

TontonTremblay commented 2 years ago

So I have been looking around my machine a lot to find a video for like 10 min, and I do not have one sadly. I also asked Yunzhi (I do not have his github handle).

Something I remember though and I am not sure I fixed it when uploading the code, there was a difference when training and inference was ran for the image processing. I believe possibly one was clamped and the other one was not. Also the values used for normalization might be different. Make sure the inference image processing matches your image training processing. Sorry this was a mistake on my end that I should have paid attention to, but I have not touched this code in 2 years almost.

I have found the weights though. https://drive.google.com/file/d/1ergPUxKDRoeIOvqWOpmWzoR-jmNk0ou1/view?usp=sharing (shiny data trained). And https://drive.google.com/file/d/14vidcerMyQy-cCjfd7n5c4HPZQFlzk0A/view?usp=sharing for using the original model. You can probably run these to compare with your results.

TontonTremblay commented 2 years ago

Regarding the nvisii script, yes I can share more information about what was used to generate the data.

    'roughness_1':0.5,
    'roughness_2':0.3,
    'metallic_1':0.4,
    'metallic_2':0.99,
    'specular_1':0.7,
    'specular_2':0.9,
mask = cv2.imread(opt.path_obj + "/google_16k/mask.png")

# mask = np.zeros((8000,8000,3))

mask = mask/255.0
mask = np.concatenate(
    [
        mask[:,:,1].reshape(mask.shape[0],mask.shape[1],1),
        mask[:,:,1].reshape(mask.shape[0],mask.shape[1],1),
        mask[:,:,1].reshape(mask.shape[0],mask.shape[1],1),
        mask[:,:,1].reshape(mask.shape[0],mask.shape[1],1)
    ],
    2
) 
mask =np.flipud(mask).astype(np.float32)

########

            roughness_texture = mask.copy() 
            roughness_texture[roughness_texture<0.5] = data_json['roughness_1']
            roughness_texture[roughness_texture>0.5] = data_json['roughness_2']
            print(' visii_')
            roughness_texture = visii.texture.create_from_data(
                'roughness',
                roughness_texture.shape[0],
                roughness_texture.shape[1],
                roughness_texture,
                # linear=True,
            ) 
            material_shiny.set_roughness_texture(roughness_texture)

            print('metallic')
            metallic_texture = mask.copy() 
            metallic_texture[metallic_texture<0.5] = data_json['metallic_1']
            metallic_texture[metallic_texture>0.5] = data_json['metallic_2']
            print(' visii_')
            metallic_texture = visii.texture.create_from_data(
                'metallic',
                metallic_texture.shape[0],
                metallic_texture.shape[1],
                metallic_texture,
                # linear=True,
            ) 
            material_shiny.set_metallic_texture(metallic_texture)

            print('specular')
            specular_texture = mask.copy() 
            specular_texture[specular_texture<0.5] = data_json['specular_1']
            specular_texture[specular_texture>0.5] = data_json['specular_2']
            print(' visii_')
            specular_texture = visii.texture.create_from_data(
                'specular',
                specular_texture.shape[0],
                specular_texture.shape[1],
                specular_texture,
                # linear=True,
            ) 
            material_shiny.set_specular_texture(specular_texture)

The code for the material looked like something like this, where the values are taken from above.

I added the textures I use:

texture_map_flat mask

TontonTremblay commented 2 years ago

Also I am sorry your results are not that great, full disclaimer, I do not think our results were much more better. The point of the experiment was to show that modeling light and metallic reflection allows us to do better detection / pose estimation.

520xyxyzq commented 2 years ago

Thank you for the prompt reply! I will try that out and report back.

TontonTremblay commented 2 years ago

Ok I messed up, the normalization are not the same between training and inference, so I updated that to be the same. I pushed to the repo. Let me know if this helps.

TontonTremblay commented 2 years ago

Thank you to Yunzhi we have a comparison. https://drive.google.com/file/d/1OG8JFusj2sLKrJMVN2_y9d3gw3v1uFjw/view?usp=sharing https://drive.google.com/file/d/1DUfDtwZD56jcbG5gkBxIh_VaZt3PaQTA/view?usp=sharing

One is the comparison between NViSII_shiny and ndds and the other is the comparison between NViSII_original and ndds. Red is for NViSII_original, green is for NViSII_shiny, and blue is for ndds

You can see that it is not as stable as ndds for frame to frame detection, similar to yours. I did not have enough time to dig into it, but the data diversity was not as wide with the nvisii data, since it is only flying around. Also maybe post processing could solve most of these problems. Not sure, but I hope this helps.

520xyxyzq commented 2 years ago

Thank you @TontonTremblay for the update and @Uio96 Yunzhi for the videos!

I tried the new normalization. The results seem to be slightly improved but are still very noisy. Here is the video. (YCB-V seq 8 is a challenging one.)

I also tried using the weights you shared for inference on the same seq. Here is the video. I think the two results are qualitatively similar.

I will try generating some new YCB object data and get back to you. Thank you!

TontonTremblay commented 2 years ago

I am not a huge fan of the YCB-videos tbh. The rgb quality is bad.

I would say start by debugging more the current weights. What are the belief maps doing, are we getting a lot of false positive or are we just missing the object? What are the peak values? What causes the detections to wiggle in size? Is it the peak detection or is something else. Is there a way to generate synthetic data that covers these cases?

Also there is a lot going on here. The data is different, but also the training script. I might have made changes that hurt the training. Something that might be interesting would be to use the old training script. I think how the belief maps are generate is different, the old version generates the belief map in the input resolution than rescale it into the output size, e.g., 400x400 to 50x50. In the second version I generate it directly into the output size.

Also the size of the belief maps might be interesting to. I did some experiments for the better size I believe, but maybe it does not train well.

And for the synthetic data, adding something that does objects falling on the plane, like FAT, or the data you get from blenderproc will allow you to mix with dome-dr data I shared would also improve the results.

Anyway all in all, there are a lot of things that can influence the differences. And probably the major thing would be that I spent 2 years working on the original training script/inference and data. It is messy because there are a lot of exploration in there, but it works because of that. The second version was probably a few months of work and I am not sure I did not break something in the process. And I did not test as extensively as I did the first version. Sorry about that.

520xyxyzq commented 2 years ago

Thank you @TontonTremblay for the great suggestions! I will try playing with them.

Regarding data generation, I'm a little confused about the snippets you shared. Are they snippets from this script, if so, where should I insert them? Sorry that I'm not very familiar with pybullet and nvisii. I'd really appreciate it if you can provide more details.

And may I also ask some quick questions? (1) Are the distractors sampled from the "google scanned object dataset" as suggested in readme? They look very different. I think a very large nb_distractors is used, right?

(2) No HDRI maps are used for the spam can data, right?

(3) Are you using the model from the YCB website, or the one that is processed to have origin at center?

(4) How can we control the object motion so it doesn't go beyond the image frame?

Thank you!

520xyxyzq commented 2 years ago

Hi @TontonTremblay, I have a quick question regarding the evaluation on YCB videos. I wonder what thresholds were used for inference on the test set. Are they the same as listed in config_pose.yaml? Do we assume that the missed pose detection (w/ number of "valid points" < 4) does not pass any ADD test? Many thanks!

TontonTremblay commented 2 years ago

Sorry late reply. The code I shared it related to how you load the 3d mode and its texture. Look at the nvisii examples about procedural textures. It will make more sense. Try to load the spam instead of the textured floor.

1) The objects are just randomly generated from meshes we create in nvisii. 2) that is correct only point lights. 3) It is the processed one. https://drive.google.com/file/d/1UoKklb33EU54wH6EX-V8rA05768JJ6XJ/view?usp=sharing they are all there. 4) use pytbullet and create fuscrum that makes the object bounce. This is what the nvisii script does. Try to run it with --interactive.

They are the same as in config_pose I think. Not 100% sure though. You should play with them. If you try to run pnp on <4 points. It wont work well, but you can try.

520xyxyzq commented 2 years ago

Thank you for the reply!!