It takes longer and longer to train an epoch

XiangLiu0731 commented 3 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

JikC commented 3 years ago

It looks like you've already run the code successfully. But do you have this problem "import nori2 as nori"? I can't install it and it doesn't in pip. Looking forward to your reply.

XiangLiu0731 commented 3 years ago

yes，I train the net on the modelnet40 dataset，and it does not need the nori2 package， so I just comment "import nori2 as nori" (I also failed to install nori2).

JikC commented 3 years ago

Ok, thanks for your reply!

35p32 commented 3 years ago

Ok, thanks for your reply!

sir, have you deal with the zip problem?

35p32 commented 3 years ago

Ok, thanks for your reply!

can you please share your solutions?

hxwork commented 2 years ago

yes，I train the net on the modelnet40 dataset，and it does not need the nori2 package， so I just comment "import nori2 as nori" (I also failed to install nori2).

I am sorry that I forgot to delete this line. nori2 is not required in this project, and you can just remove this line. I will fix this bug.

hxwork commented 2 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.

JikC commented 2 years ago

Ok, thanks for your reply!

sir, have you deal with the zip problem?

Sorry, I haven't dealt with it.

XiangLiu0731 commented 2 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.

I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training? 微信截图_20211116181703

hxwork commented 2 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.

I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?

Hi,

I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.

XiangLiu0731 commented 2 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.

I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?

Hi,

I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.

Yes, I define my own dataloader. I checked my dataloader and it seems no bug in my dataloader. I would appreciate it if you could confirm it for me. The code of my dataloader is as follows:

class MyModelNetloader(Dataset):
    def __init__(self, num_points=1024, partition='train', gaussian_noise=False, alpha=0.75, factor=4, FSP=False):
        super(MyModelNetloader, self).__init__()
        self.num_points = num_points
        self.partition = partition
        self.gaussian_noise = gaussian_noise
        self.rot_factor = factor
        self.FSP = FSP
        ## load data for training/validation/test
        if partition == 'val':
            self.data, self.label = load_data('train')  # function return the modelnet data [..., 2048, 3] and label [..., 1]
            np.random.seed(1)
            np.random.permutation(self.data)
            num_val = int(len(self.data) * 0.2)
            self.data, self.label = self.data[:num_val, ::], self.label[:num_val, ::]
        elif partition == 'train':
            self.data, self.label = load_data('train')
            np.random.seed(1)
            np.random.permutation(self.data)
            num_val = int(len(self.data) * 0.2)
            self.data, self.label = self.data[num_val:, ::], self.label[num_val:, ::]
        else:
            self.data, self.label = load_data(partition)
        self.num_subsampled_points = int(self.num_points * alpha)  

    def __getitem__(self, item):
        ## sample twice to generate the source and target point cloud
        points = self.data[item]
        points = np.random.permutation(points)
        pointcloud1 = points[:self.num_points, :].T
        points = np.random.permutation(points)
        pointcloud2 = points[:self.num_points, :].T

        ## generate random transformation
        R_ab, translation_ab, euler_ab = random_Rt(np.pi / self.rot_factor, max_trans=0.5)
        pointcloud2 = np.matmul(R_ab, pointcloud2) + translation_ab[:, np.newaxis]  # (3,) -> (3, 1)
        pointcloud1 = np.random.permutation(pointcloud1.T).T
        pointcloud2 = np.random.permutation(pointcloud2.T).T

        ## jitter point cloud
        if self.gaussian_noise:
            pointcloud1 = jitter_pointcloud(pointcloud1)
            pointcloud2 = jitter_pointcloud(pointcloud2)

        ## DCP partial manner
        if self.FSP:
            pointcloud1, pointcloud2 = farthest_subsample_points(pointcloud1, pointcloud2,
                                                                num_subsampled_points=self.num_subsampled_points)
        pointcloud1 = pointcloud1.T
        pointcloud2 = pointcloud2.T  # [N, 3]

        ## return the data in the OMNet format
        rand_SE3 = np.concatenate((R_ab, translation_ab[:, None]), axis=1).astype(np.float32)  # [3,4]
        sample = {"points_src": pointcloud1, "points_ref": pointcloud2, "points_src_raw": points, "points_ref_raw": points, "transform_gt": rand_SE3, "pose_gt": se3.np_mat2quat(rand_SE3)}
        return sample

    def __len__(self):
        return len(self.data)

Thank you for your patient answer again!

hxwork commented 2 years ago

thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.

I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.

I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?

Hi, I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.

Yes, I define my own dataloader. I checked my dataloader and it seems no bug in my dataloader. I would appreciate it if you could confirm it for me. The code of my dataloader is as follows:
class MyModelNetloader(Dataset):
    def __init__(self, num_points=1024, partition='train', gaussian_noise=False, alpha=0.75, factor=4, FSP=False):
        super(MyModelNetloader, self).__init__()
        self.num_points = num_points
        self.partition = partition
        self.gaussian_noise = gaussian_noise
        self.rot_factor = factor
        self.FSP = FSP
        ## load data for training/validation/test
        if partition == 'val':
            self.data, self.label = load_data('train')  # function return the modelnet data [..., 2048, 3] and label [..., 1]
            np.random.seed(1)
            np.random.permutation(self.data)
            num_val = int(len(self.data) * 0.2)
            self.data, self.label = self.data[:num_val, ::], self.label[:num_val, ::]
        elif partition == 'train':
            self.data, self.label = load_data('train')
            np.random.seed(1)
            np.random.permutation(self.data)
            num_val = int(len(self.data) * 0.2)
            self.data, self.label = self.data[num_val:, ::], self.label[num_val:, ::]
        else:
            self.data, self.label = load_data(partition)
        self.num_subsampled_points = int(self.num_points * alpha)  

    def __getitem__(self, item):
        ## sample twice to generate the source and target point cloud
        points = self.data[item]
        points = np.random.permutation(points)
        pointcloud1 = points[:self.num_points, :].T
        points = np.random.permutation(points)
        pointcloud2 = points[:self.num_points, :].T

        ## generate random transformation
        R_ab, translation_ab, euler_ab = random_Rt(np.pi / self.rot_factor, max_trans=0.5)
        pointcloud2 = np.matmul(R_ab, pointcloud2) + translation_ab[:, np.newaxis]  # (3,) -> (3, 1)
        pointcloud1 = np.random.permutation(pointcloud1.T).T
        pointcloud2 = np.random.permutation(pointcloud2.T).T

        ## jitter point cloud
        if self.gaussian_noise:
            pointcloud1 = jitter_pointcloud(pointcloud1)
            pointcloud2 = jitter_pointcloud(pointcloud2)

        ## DCP partial manner
        if self.FSP:
            pointcloud1, pointcloud2 = farthest_subsample_points(pointcloud1, pointcloud2,
                                                                num_subsampled_points=self.num_subsampled_points)
        pointcloud1 = pointcloud1.T
        pointcloud2 = pointcloud2.T  # [N, 3]

        ## return the data in the OMNet format
        rand_SE3 = np.concatenate((R_ab, translation_ab[:, None]), axis=1).astype(np.float32)  # [3,4]
        sample = {"points_src": pointcloud1, "points_ref": pointcloud2, "points_src_raw": points, "points_ref_raw": points, "transform_gt": rand_SE3, "pose_gt": se3.np_mat2quat(rand_SE3)}
        return sample

    def __len__(self):
        return len(self.data)
Thank you for your patient answer again!

Hi,

I am sorry that I can not find where the bug is located from your code. Maybe you can check it by removing each part of your code or each function used in your code, such as farthest_subsample_points, separately. In addition, using FPS in __getitem__ may lead to low training speed when the number of input points is large.

MegEngine / OMNet

It takes longer and longer to train an epoch #5