PRBonn / Mask4D

Mask4D: End-to-End Mask-Based 4D Panoptic Segmentation for LiDAR Sequences, RA-L, 2023
MIT License
50 stars 3 forks source link

training issues #2

Closed scw0819 closed 7 months ago

scw0819 commented 9 months ago

Thank you for your open source work. But I have some questions about the training part of the code. When I use the training code train_model.py, in this part trainer = Trainer( gpus=cfg.TRAIN.N_GPUS, accelerator="ddp", logger=tb_logger, max_epochs=cfg.TRAIN.MAX_EPOCH, callbacks=callbacks, log_every_n_steps=1, gradient_clip_val=0.5, accumulate_grad_batches=cfg.TRAIN.BATCH_ACC, resume_from_checkpoint=ckpt, ) It seems that the accelerator="ddp" option does not exist in pytorch_lightning, only the strategy="ddp" option. I tried to make some code modifications, but ultimately did not solve the problem. But when I train on a RTX 4090, my video memory seems to be lacking. By the way, what type of graphics card do you have and how many GB of video memory is it?

rmarcuzzi commented 9 months ago

Hi! I think that depends on the pytorch lightning version that you have. In version 2.0.8 you can use the option accelerator="ddp" but in newer versions, this option has been removed. You can try setting the strategy="ddp" or finding a way that works with your particular GPU.

The GPU I used is a NVIDIA RTX A6000 with 50 Gb of memory. If you want to train with a smaller GPU, you can subsample more the point cloud here or decrease the number of scans used as input to 2 here.

I hope this helps!

scw0819 commented 6 months ago

Sorry to bother you again, I've been too busy.

I carefully checked the relevant information about each version of Pytorch lightning. In version 2.0.8 here, accelerator='ddp' does not exist, but this option does exist in versions below 1.5 here. Can you help me answer this question? Am I wrong?

I tried two versions of the configuration. When I could use the accelerator='ddp' option, it meant that I had to lower my Pytorch accordingly, which also had a greater misfit problem with the model. When I use accelerator='cuda', devices=4, strategy='ddp'in version 2.0.8, an error occurs again RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string valuestrategy='ddp_find_unused_parameters_true'or by setting the flag in the strategy withstrategy=DDPStrategy(find_unused_parameters=True). This also makes me very confused, and it seems that the modification will have an impact on the distributed training of the model.

Last problem, when I solved all the bugs to train the model, since my 4 RTX4090 has 24GB of memory, I had to decrease the number of scans used as input to 2 here, and subsample less point clouds here. It will also cause the entire training result to be reduced by about 8% compared to the accuracy of your paper. Is this normal? By the way, how much memory does it take when you use RTX A6000 for training? This will be crucial for me to choose other graphics cards in the future.

Looking forward to and thank you for your answer! We apologize again for not replying to you in time!

rmarcuzzi commented 6 months ago

Hi again and thanks for your interest in our work!

Depending on the Pytorch lightning version that you have, you might have to set up a different value for accelerator like accelerator='ddp' or accelerator='gpu'. The runtime error that you get shouldn't impact the performance of the model.

Regarding your training, keep in mind that you should use the weights of MaskPLS but apart from that, decreasing the number of scans and subsampling more, although it allows you to train, leads to a decrease in performance. I think the main reason is that, at every training step, you're performing fewer tracking steps, and therefore the queries (which encode the appearance of the instances) only have to track the instances for a short time.

Because of the sequential nature of the training, sadly it's hard to fully leverage the multiple gpus.

This model is quite big and running the training sequentially makes it also extra "expensive" to train. I remember we were using around 40 GB of memory during training.

I hope this clarifies your concerns and feel free to reach out if you have further doubts!

scw0819 commented 5 months ago

Thank you for your reply!

I configured the environment on RTX A40 and successfully ran the code, with N_SCANS=3, SUB_NUM_POINTS=80000, Batch_Size=8. I think it can speed up my training. The training time for one epoch is less than an hour. During this period, The memory has also reached more than 40GB, which seems to be no problem. But when I completed 50 epochs, the training results were not even as good as the effect of N_SCANS=2 and SUB_NUM_POINTS=70000, only about 40%. After several experiments, I found that this may be related to Batch_Size. When set to 1, the accuracy performance is It's relatively normal. What's the reason? If the epochs are increased, do you think the final result can achieve the desired accuracy?

For some reasons, I changed the backbone network, which also resulted in me not being able to use maskpls.ckpt. Do I need to train for a full 50 epochs to get the best accuracy performance? There are still some problems when I use multi-card training. , is there any way you can shorten the training time?

Looking forward to your answers!

rmarcuzzi commented 5 months ago

Hi! In the dataloader, the batch size for training is fixed to be 1 here. The batch size in the config is only useful for validation. What you can change is the number of frames that you sample N_SCANS but the forward passes will run individually. To emulate a bigger batch size you can set the BATCH_ACC to a larger value. This way, the gradients are accumulated for a couple of batches and only then the optimization step takes place. If the weights of the checkpoint don't match the weights of the network, you'll have to retrain it from scratch. In my experiments, some models achieved the best performance earlier than in epoch 50 but I don't have specific numbers. Because the validation is done sequentially, you can only use a single GPU. You could run only the training (disabling the validation step) on multiple GPUs, saving a checkpoint after each epoch and then evaluate the checkpoints separately although this might be a bit annoying. Sadly the training time is quite large and I couldn't shorten it either.

I hope my answer is useful!