Closed XiangLiu0731 closed 2 years ago
It looks like you've already run the code successfully. But do you have this problem "import nori2 as nori"? I can't install it and it doesn't in pip. Looking forward to your reply.
yes,I train the net on the modelnet40 dataset,and it does not need the nori2 package, so I just comment "import nori2 as nori" (I also failed to install nori2).
Ok, thanks for your reply!
Ok, thanks for your reply!
sir, have you deal with the zip problem?
Ok, thanks for your reply!
can you please share your solutions?
yes,I train the net on the modelnet40 dataset,and it does not need the nori2 package, so I just comment "import nori2 as nori" (I also failed to install nori2).
I am sorry that I forgot to delete this line. nori2 is not required in this project, and you can just remove this line. I will fix this bug.
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.
I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.
Ok, thanks for your reply!
sir, have you deal with the zip problem?
Sorry, I haven't dealt with it.
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.
I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.
I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.
I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.
I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?
Hi,
I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.
I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.
I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?
Hi,
I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.
Yes, I define my own dataloader. I checked my dataloader and it seems no bug in my dataloader. I would appreciate it if you could confirm it for me. The code of my dataloader is as follows:
class MyModelNetloader(Dataset):
def __init__(self, num_points=1024, partition='train', gaussian_noise=False, alpha=0.75, factor=4, FSP=False):
super(MyModelNetloader, self).__init__()
self.num_points = num_points
self.partition = partition
self.gaussian_noise = gaussian_noise
self.rot_factor = factor
self.FSP = FSP
## load data for training/validation/test
if partition == 'val':
self.data, self.label = load_data('train') # function return the modelnet data [..., 2048, 3] and label [..., 1]
np.random.seed(1)
np.random.permutation(self.data)
num_val = int(len(self.data) * 0.2)
self.data, self.label = self.data[:num_val, ::], self.label[:num_val, ::]
elif partition == 'train':
self.data, self.label = load_data('train')
np.random.seed(1)
np.random.permutation(self.data)
num_val = int(len(self.data) * 0.2)
self.data, self.label = self.data[num_val:, ::], self.label[num_val:, ::]
else:
self.data, self.label = load_data(partition)
self.num_subsampled_points = int(self.num_points * alpha)
def __getitem__(self, item):
## sample twice to generate the source and target point cloud
points = self.data[item]
points = np.random.permutation(points)
pointcloud1 = points[:self.num_points, :].T
points = np.random.permutation(points)
pointcloud2 = points[:self.num_points, :].T
## generate random transformation
R_ab, translation_ab, euler_ab = random_Rt(np.pi / self.rot_factor, max_trans=0.5)
pointcloud2 = np.matmul(R_ab, pointcloud2) + translation_ab[:, np.newaxis] # (3,) -> (3, 1)
pointcloud1 = np.random.permutation(pointcloud1.T).T
pointcloud2 = np.random.permutation(pointcloud2.T).T
## jitter point cloud
if self.gaussian_noise:
pointcloud1 = jitter_pointcloud(pointcloud1)
pointcloud2 = jitter_pointcloud(pointcloud2)
## DCP partial manner
if self.FSP:
pointcloud1, pointcloud2 = farthest_subsample_points(pointcloud1, pointcloud2,
num_subsampled_points=self.num_subsampled_points)
pointcloud1 = pointcloud1.T
pointcloud2 = pointcloud2.T # [N, 3]
## return the data in the OMNet format
rand_SE3 = np.concatenate((R_ab, translation_ab[:, None]), axis=1).astype(np.float32) # [3,4]
sample = {"points_src": pointcloud1, "points_ref": pointcloud2, "points_src_raw": points, "points_ref_raw": points, "transform_gt": rand_SE3, "pose_gt": se3.np_mat2quat(rand_SE3)}
return sample
def __len__(self):
return len(self.data)
Thank you for your patient answer again!
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.
I am not sure what batch size you are setting. In our paper, we use batch_size=64, and it only takes about 1min to train an epoch.
I train the net on an Nvidia Geforce RTX 2080 SUPER GPU and set the batch_size=8. It takes about 9 minutes to train the first epoch, but it takes about 15 minutes to train the second epoch, and so on, it takes longer and longer to train one epoch later. I doubt whether there is a memory leak bug in the code. I am not familiar with the MegEngine and I could not find the bug in the code. Have you encountered similar problems during training?
Hi, I have tried to reproduce the problem you have encountered, however, all is well with the training process. Each epoch takes approximately 3 minutes. Besides, I notice that your dataset is not the same as what I open-sourced, since setting batch_size=8, the number of iteration is 524 in an epoch. Therefore, I suspect that may be it exists some bugs in your dataloader.
Yes, I define my own dataloader. I checked my dataloader and it seems no bug in my dataloader. I would appreciate it if you could confirm it for me. The code of my dataloader is as follows:
class MyModelNetloader(Dataset): def __init__(self, num_points=1024, partition='train', gaussian_noise=False, alpha=0.75, factor=4, FSP=False): super(MyModelNetloader, self).__init__() self.num_points = num_points self.partition = partition self.gaussian_noise = gaussian_noise self.rot_factor = factor self.FSP = FSP ## load data for training/validation/test if partition == 'val': self.data, self.label = load_data('train') # function return the modelnet data [..., 2048, 3] and label [..., 1] np.random.seed(1) np.random.permutation(self.data) num_val = int(len(self.data) * 0.2) self.data, self.label = self.data[:num_val, ::], self.label[:num_val, ::] elif partition == 'train': self.data, self.label = load_data('train') np.random.seed(1) np.random.permutation(self.data) num_val = int(len(self.data) * 0.2) self.data, self.label = self.data[num_val:, ::], self.label[num_val:, ::] else: self.data, self.label = load_data(partition) self.num_subsampled_points = int(self.num_points * alpha) def __getitem__(self, item): ## sample twice to generate the source and target point cloud points = self.data[item] points = np.random.permutation(points) pointcloud1 = points[:self.num_points, :].T points = np.random.permutation(points) pointcloud2 = points[:self.num_points, :].T ## generate random transformation R_ab, translation_ab, euler_ab = random_Rt(np.pi / self.rot_factor, max_trans=0.5) pointcloud2 = np.matmul(R_ab, pointcloud2) + translation_ab[:, np.newaxis] # (3,) -> (3, 1) pointcloud1 = np.random.permutation(pointcloud1.T).T pointcloud2 = np.random.permutation(pointcloud2.T).T ## jitter point cloud if self.gaussian_noise: pointcloud1 = jitter_pointcloud(pointcloud1) pointcloud2 = jitter_pointcloud(pointcloud2) ## DCP partial manner if self.FSP: pointcloud1, pointcloud2 = farthest_subsample_points(pointcloud1, pointcloud2, num_subsampled_points=self.num_subsampled_points) pointcloud1 = pointcloud1.T pointcloud2 = pointcloud2.T # [N, 3] ## return the data in the OMNet format rand_SE3 = np.concatenate((R_ab, translation_ab[:, None]), axis=1).astype(np.float32) # [3,4] sample = {"points_src": pointcloud1, "points_ref": pointcloud2, "points_src_raw": points, "points_ref_raw": points, "transform_gt": rand_SE3, "pose_gt": se3.np_mat2quat(rand_SE3)} return sample def __len__(self): return len(self.data)
Thank you for your patient answer again!
Hi,
I am sorry that I can not find where the bug is located from your code. Maybe you can check it by removing each part of your code or each function used in your code, such as farthest_subsample_points
, separately. In addition, using FPS in __getitem__
may lead to low training speed when the number of input points is large.
thank you for your sharing code! When I want to retrain the network, we find it takes longer and longer to train an epoch, and I'm not sure that's normal.