about training time and batch size

Botranz commented 4 years ago

Hi~ I found that the training time is quiet long since the code only use batch_size=1 in training. What is the approximate total training time when using batch_size=1? And can we use a larger batch_size for training? If so, will the performance drop? Thanks a lot!

Zhang-VISLab commented 4 years ago

Hi Yecheng,

Can you reply to this message? Thx

Best wishes,

Ziming

On May 9, 2020, at 6:21 AM, Botranz notifications@github.com wrote:

Hi~ I found that the training time is quiet long since the code only use batch_size=1 in training. What is the approximate total training time when using batch_size=1? And can we use a larger batch_size for training? If so, will the performance drop? Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Zhang-VISLab/Learning-to-Segment-3D-Point-Clouds-in-2D-Image-Space/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZLFJMQYBOLJ3MKYHZMD6DRQUVDHANCNFSM4M4XVR3Q.

WangZhouTao commented 4 years ago

I have the same question and hope to get your reply.

YechengLyu commented 4 years ago

Hello Botranz and ZhouTao,

It took 15-20 hours for a complete training on my machine (RTX2080Ti). Batched training is possible by setting the batch_size to other integers in line 29. However, during my experiment I found that batched training did not save training time; in contrast, it took more time to prepare the batch in DataGenerator. For performance, I got similar result in my customized metric "weighted_acc" but I did not verify the IoUs. Hope those information helps.

Yecheng

WangZhouTao commented 4 years ago

Hello Botranz and ZhouTao,

It took 15-20 hours for a complete training on my machine (RTX2080Ti). Batched training is possible by setting the batch_size to other integers in line 29. However, during my experiment I found that batched training did not save training time; in contrast, it took more time to prepare the batch in DataGenerator. For performance, I got similar result in my customized metric "weighted_acc" but I did not verify the IoUs. Hope those information helps.

Yecheng

Thank you for your reply.

WangZhouTao commented 4 years ago

Hi~ My machine is in-built 7700K CPU and a 1080ti GPU, it took 2.5 hours for one epoch training. It took more than 200 hours to complete training. In the process of training, I found that GPU utilization and GPU memory utilization are very low. Is there any special training skill? Looking forward to your early reply.

YechengLyu commented 4 years ago

It is possible to speed up the training process. The bottleneck of the training scheme is the point-cloud-image generator. In the current scheme, a point-cloud-image is generated for each sample in each epochs. However, if we pre-generate the images and save them as a dataset (.hdf5 or .npy), we can save time during the training. I tried that before by running the generator 5x times and save a very big .hdf5 file (200+GB). The dataset generation took me 10 hours but the training part took no more than 30 hours. The accuracy dropped a bit, and I guess it was because we had less data augmentation.

faultaddr commented 4 years ago

It is possible to speed up the training process. The bottleneck of the training scheme is the point-cloud-image generator. In the current scheme, a point-cloud-image is generated for each sample in each epochs. However, if we pre-generate the images and save them as a dataset (.hdf5 or .npy), we can save time during the training. I tried that before by running the generator 5x times and save a very big .hdf5 file (200+GB). The dataset generation took me 10 hours but the training part took no more than 30 hours. The accuracy dropped a bit, and I guess it was because we had less data augmentation.

Nvidia V100 32GB version, it tooks me 2.5 hours for one epoch training too. we set the batch size as 32, but i found the training acc was only 0.71 (just acc not iou, IOU would be much lower in test), This is quite different from the result of the paper.

then I change the batch-size as 1, it still running now, we would let u know the result when it finished, but the training time is too long...

WangZhouTao commented 4 years ago

It is possible to speed up the training process. The bottleneck of the training scheme is the point-cloud-image generator. In the current scheme, a point-cloud-image is generated for each sample in each epochs. However, if we pre-generate the images and save them as a dataset (.hdf5 or .npy), we can save time during the training. I tried that before by running the generator 5x times and save a very big .hdf5 file (200+GB). The dataset generation took me 10 hours but the training part took no more than 30 hours. The accuracy dropped a bit, and I guess it was because we had less data augmentation.

Nvidia V100 32GB version, it tooks me 2.5 hours for one epoch training too. we set the batch size as 32, but i found the training acc was only 0.71 (just acc not iou, IOU would be much lower in test), This is quite different from the result of the paper.

then I change the batch-size as 1, it still running now, we would let u know the result when it finished, but the training time is too long...

Thank you for sharing. I will follow your news.

YechengLyu commented 4 years ago

It is possible to speed up the training process. The bottleneck of the training scheme is the point-cloud-image generator. In the current scheme, a point-cloud-image is generated for each sample in each epochs. However, if we pre-generate the images and save them as a dataset (.hdf5 or .npy), we can save time during the training. I tried that before by running the generator 5x times and save a very big .hdf5 file (200+GB). The dataset generation took me 10 hours but the training part took no more than 30 hours. The accuracy dropped a bit, and I guess it was because we had less data augmentation.

Nvidia V100 32GB version, it tooks me 2.5 hours for one epoch training too. we set the batch size as 32, but i found the training acc was only 0.71 (just acc not iou, IOU would be much lower in test), This is quite different from the result of the paper.

then I change the batch-size as 1, it still running now, we would let u know the result when it finished, but the training time is too long...

Thank you for your reply. I will release my code with data pre-preparation code and corresponding training code as soon as I get them organized and tested.

YechengLyu commented 4 years ago

Hello I have released the code for training from pre-prepared dataset together with my reproduced models and training logs.

Zhang-VISLab commented 4 years ago

Dear all,

Is there still any issue with the code? Please do let us know if there is any, otherwise we will close this issue. Thanks.

Zhang-VISLab commented 4 years ago

Dear all,

Thanks for your great comments. We are terribly sorry that we lost our CVPR 2020 code after submission. This repository is a reproduced work, and we released a pre-trained network model with 88.0% instance-mean-iou and 86.5% class-mean-iou. An updated ArXiv preprint is available. Could you please check our code to see whether there still exist some issues? If so, please open a new issue and we will resolve it asap. Thanks.

Zhang-VISLab / Learning-to-Segment-3D-Point-Clouds-in-2D-Image-Space

about training time and batch size #3