NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.02k stars 286 forks source link

Questions regarding NViSII data generation #215

Closed ghost closed 2 years ago

ghost commented 2 years ago

My goal is to localize a single pallet that is not occluded and is always laying on the ground. I used the provided nvisii data generation script as a starting point and modified it so that only one instance of the pallet is loaded and only rotated around the vertical axis without any distractors. Here are some example images:

00050 00058 00057 00088

I created the camera entity from intrinsics and included the height and width of the images that the camera records(640x480). Does the new training script require the images in the data set to be square, e.g. 400x400? If yes, should I just set the height and width inside the create_from_intrinsics command to the same value, or would the created images then not resemble the actual camera properties anymore?

Is there maybe a better way to only render a squared section of the camera image?

TontonTremblay commented 2 years ago

Your images look pretty good. https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/train2/utils_dope.py#L260 if you look at the training code, it will use random crop of the image. Since DOPE training is image based only it does not matter the resolution you are using. So you can render 512x512 your images, the random crop is a data augmentation so it will help the training process. I would say you should be careful with the symmetry in your object. @mintar Do you think we could add a little paragraph on using your script for symmetries in the train2/readme.md I think it might help people that are all training on symmetrical objects.

mintar commented 2 years ago

@markusiscoding : I agree with @TontonTremblay that your 640x480 training images should be fine. During the training stage, the camera intrinsics don't matter (only during inference). And yes, you'll have to take care of the symmetries. Since you're limiting the rotation of the object anyway, an easy way to do this would probably be to make sure that the x axis of the mesh never points towards the camera. Specifically, if the x_axis_mesh is one of the two horizontal axes of the mesh, and the z_axis_mesh is the vertical axis of the mesh, and translation is the position of the object in the camera coordinate system (where the z axis of the camera points along the view direction of the camera), then if x_axis_mesh.dot(translation) < 0, flip the object around its z axis. Similar to what I did in my flip_symmetrical_objects.py script.

In general I believe your training data set is too restricted - you'll probably get better results if you add more distractors, more view angles of the object, different distances to the object and more (or less) than one training object per image. The point of synthetic training data set generation isn't to show the object in "ideal" conditions, but to have a superset of what is expected to happen in the real application. It's okay if there are effects in the training data set that don't happen in real life, but it will hurt performance if there are effects in real life that didn't occur in the training data. That's why you want all kinds of weird effects in the training data set - occlusions, shadows, extreme lighting conditions, extreme close-ups, touching objects, weird angles, and so on. Only exclude conditions that you can 100% guarantee will never happen (such as the object being flipped upside down, if that really can never happen), and err on the side of including them anyways.

mintar commented 2 years ago

@TontonTremblay wrote:

@mintar Do you think we could add a little paragraph on using your script for symmetries in the train2/readme.md I think it might help people that are all training on symmetrical objects.

Yes, that would be a really good improvement - many people struggle with this. However, there are a couple of problems with the current flip_symmetrical_objects.py script:

So I think before we really can add it, somebody would have to address these issues. Unfortunately, I don't have time for that now, but if someone else would do that and submit a pull request here that would be awesome.

ghost commented 2 years ago

Thank you for your fast and descriptive replies! I trained DOPE on a data set generated with NDDS before and also encountered issues with symmetry (180 degree jumps in orientation around the vertical axis during testing, while position estimation was still fine). I was wondering how I could get rid of that.

Since you're limiting the rotation of the object anyway, an easy way to do this would probably be to make sure that the x axis of the mesh never points towards the camera. Specifically, if the x_axis_mesh is one of the two horizontal axes of the mesh, and the z_axis_mesh is the vertical axis of the mesh, and translation is the position of the object in the camera coordinate system (where the z axis of the camera points along the view direction of the camera), then if x_axis_mesh.dot(translation) < 0, flip the object around its z axis. Similar to what I did in my flip_symmetrical_objects.py script.

@mintar If I understand your explanation correctly, I want to limit the pallet's rotation around the vertical axis to that the x-axis always points in the 180° range facing away from the camera. In the modified script that I am using, I randomize the rotation around the z-axis, so my approach would be to limit the rotation like this:

theta = math.radians(random.uniform(90, 270)) # assuming x faces the camera at theta=0
pallet.set_rotation(
            visii.quat(
                math.cos(theta/2),
                0,
                0,
                math.sin(theta/2),
            )
)

Does that make sense or did I misunderstand?

Regarding the other comments: I guess I misunderstood the idea of generating photorealistic synthetic data. I thought data that looks as close to real images would be ideal so I only took indoor HDRIs with similar environments as the test hall at my university and limited the position of the pallet to the distance in which I wanted to be able to localize it. I guess I will add some outdoor HDRIs and some distractors as well. Would this also be considered domain randomization even though the textures of the pallet and the background images are not unrealistic colors/patterns?

mintar commented 2 years ago

Does that make sense or did I misunderstand?

Yes, that's exactly what I meant. What you want to avoid is two images where the object looks identical, but one is labeled with pose "x, y, z, roll, pitch, yaw" and the other one is labeled with "x, y, z, roll, pitch, (yaw + pi)" (which is what will happen if the object has a 180° symmetry around z).

Regarding the other comments: I guess I misunderstood the idea of generating photorealistic synthetic data. I thought data that looks as close to real images would be ideal so I only took indoor HDRIs with similar environments as the test hall at my university and limited the position of the pallet to the distance in which I wanted to be able to localize it.

In an ideal world, the (synthetic) training data would look exactly the same as the (real) testing data. In that sense, your training data looks really good. But in practice it's really hard to correctly model 100% of the effects that you see in your testing data. For example, you'll probably have some occlusions, some different object-camera angles (e.g., if the camera is rotated slightly) and so on. That's why I believe it's better to include too many effects (even unrealistic ones) than too few.

I guess I will add some outdoor HDRIs and some distractors as well.

If you have the time, you could do both: Train model A with your current dataset and model B with the extended dataset, and then compare the inference results (on real images!). Also when testing try to identify classes of situations where the recognition fails, and include them in the next iteration of the dataset. For example, I've had cases where objects very close to the camera or very far away were not recognized, or where strong reflections were a problem, so I've added those cases to my training data and retrained a much better model.

ghost commented 2 years ago

Okay, thank you for your tips. They are very helpful!

wetoo-cando commented 6 months ago

@mintar "During the training stage, the camera intrinsics don't matter (only during inference)."

However, it's important to specify the "correct" intrinsics to NVISII to generate "good" synthetic training images, right?

And "correct" here would mean that the intrinsics specified to NVISII should ideally be the same as the ones used at inference time?

mintar commented 6 months ago

Yes, using the same intrinsics for the synthetic training images and the real camera during inferencing is ideal. What I was trying to say is that DOPE is very robust to moderate differences of the camera intrinsics between training and synthetic images, because DOPE uses a two-stage approach (first predicting key points, then solving Perspective-n-Point). This is in contrast to approaches that do direct pose regression, because with those approaches, the camera intrinsics that are used during training get "baked into" the trained net.