saadehmd commented 4 years ago

This is strange but, i have been able to run all the checkpoint files provided by you, on my NVIDIA QUADRO T1000 4GiB, with a reasonable frame rate(2 FPS.) Then i trained using your training scripts on my own data and now while loading the checkpoint files i generated, the same GPU gives OUT OF MEMORY error. How is this possible if the model is the same ? And also, i tried running this same checkpoint on the other powerful SLURM GPUS( 2 , 32GiB NVIDIA Tesla V100s ), that i used for training. And the inference time was as bad 11s / it for test minbatch size = 1

I also checked my own data, the images i have are almost equal size in KBs to LINEMOD images. I am suspecting i might be doing something wrong while training.

saadehmd commented 4 years ago

Also, i noticed the following: - in the load_Checkpoint() function try:

The checkpoint i trained always get loaded here

     checkpoint = torch.load(filename)#, map_location='cpu')

except:

Your checkpoints are always loaded here

    # If i don't catch the exception it gives raise RuntimeError("Invalid magic number; corrupt file?")

RuntimeError: Invalid magic number; corrupt file? for your checkpoint checkpoint = pkl.load(open(filename, "rb"))

Also if i load my checkpoint here, it's not loaded properly, just loaded as a random integer.

I am suspecting, there's a difference in the way, both checkpoints were saved.

saadehmd commented 4 years ago

I didn't notice it before but it's actually coming from eval_pose_parallel() function: The meanshiftPyTorch.fit() is allocating some tensors that are unexpectedly huge. Can you tell what might be going wrong in the assignement of this 'Cr = C.view(N, 1, c).repeat(1, N, 1)'

File "/home/ahmad3/PVN3D/pvn3d/lib/utils/pvn3d_eval_utils.py", line 189, in eval_one_frame_pose_lm ctr, ctr_labels = ms.fit(pred_ctr[cls_msk, :]) File "/home/ahmad3/PVN3D/pvn3d/lib/utils/meanshift_pytorch.py", line 34, in fit Cr = C.view(N, 1, c).repeat(1, N, 1) RuntimeError: CUDA out of memory. Tried to allocate 1.69 GiB (GPU 0; 3.81 GiB total capacity; 1.85 GiB already allocated; 1022.19 MiB free; 2.77 MiB cached)

ethnhe commented 4 years ago

We use pickle.dump() rather than torch.save() to save the trained checkpoint on our platform, which doesn't matter in your case.
The meanshift_pytorch is a rough implementation of MeanShift algorithm, the Cr = C.view(N, 1, c).repeat(1, N, 1) will get a N*N*C tensor. If N, the number of points with the same predicted label is too large. It will trigger "CUDA out of memory". You can randomly sample some points before calling the function. You're also welcome to implement a better version of cuda MeanShift.

Michael187-ctrl commented 4 years ago

Hey saadehmd, can you please explain me how you created the training data for your own dataset?

saadehmd commented 4 years ago

I didn't notice it before but it's actually coming from eval_pose_parallel() function: The meanshiftPyTorch.fit() is allocating some tensors that are unexpectedly huge. Can you tell what might be going wrong in the assignement of this 'Cr = C.view(N, 1, c).repeat(1, N, 1)'

File "/home/ahmad3/PVN3D/pvn3d/lib/utils/pvn3d_eval_utils.py", line 189, in eval_one_frame_pose_lm ctr, ctr_labels = ms.fit(pred_ctr[cls_msk, :]) File "/home/ahmad3/PVN3D/pvn3d/lib/utils/meanshift_pytorch.py", line 34, in fit Cr = C.view(N, 1, c).repeat(1, N, 1) RuntimeError: CUDA out of memory. Tried to allocate 1.69 GiB (GPU 0; 3.81 GiB total capacity; 1.85 GiB already allocated; 1022.19 MiB free; 2.77 MiB cached)

Hey saadehmd, can you please explain me how you created the training data for your own dataset?

I collected all my data in gazebo using a kinect. Since i had accurate CAD models of all the target objects i was working with, i simply converted them to .obj files and placed them in gazebo with a bunch of different background environments. Any other light-weight rendering software like Open3D or PyRender could be used in the similar manner, I just used gazebo because there are readily avialble ros-based drivers and configuration files for kinect on gazebo's online tutorials, so i wouldn't have to setup a custom perspective camera in other rendering software. Also, since gazebo is a simulation engine with easy user-interaction where you could just simply pick and place a variety of background and foreground clutter objects and sorrounding environments without writing any extra lines of code. As for the data-collection protocol, i simply followed original YCB object dataset method. Sample rgb and depth images uniformly with camera-poses in upper hemisphere around the object. The object-label images can simply be obtained by projecting the XYZ points from .pcd(pointcloud) or points.txt of your target objects to their xy pixel coordinates (using ground truth pose of your object). I am still cleaning up the code a bit. i might push it on my branch by the end of this month. you can refer to that if it might be useful.

saadehmd commented 4 years ago

@ethnhe Thanks for your help. this issue was solved. I was actually sampling the points wrong. For some reason my model, predicted all 12280 points to be object class and that was too huge for tensors allocated in meanshift algo.

pyni commented 4 years ago

@saadehmd Hi~ i have met the same issue with you. I cannot understand why a prediction process is so fast for linemod dataset, while it is so slow for my own dataset. Would you please share us which parameters have you changed? self.n_sample_points ? or some other tricks Thanks

saadehmd commented 4 years ago

@pyni For me, the issue was with the way i collected the training data. There was a small thing i missed. The ground truth pose in my gt.yaml file had R and t values in camera optical frame. After multiplying the inverse(cam_in_world) mat with obj_in_world, i forgot to multiply the mirroring transform which is to just rotate the obj_in_cam by (1.57,0,1.57) in 'zyx' convention. So without this step, i unintentionally trained the network on wrong object poses, which meant that it considered almost anything in the background as the target object, which made sense because my segmentation loss converged while the pose-estimation loss remained huge. And while testing, it predicted all 12280 points to be of the object and hence allocated tensors for pose-calculation were huge and CUDA always RAN OUT OF MEMORY. I just added this mirroring transform in dataloader file, retrained and everything went to normal

ethnhe / PVN3D

CUDA out of memory for my own trained checkpoint files #19

The checkpoint i trained always get loaded here

Your checkpoints are always loaded here

Also if i load my checkpoint here, it's not loaded properly, just loaded as a random integer.