ZixuanChen613 commented 2 years ago

I would suggest this schedule,

Set previous previous_training_path as empty, config.pre_train as True, config.learning_rate = 1e-2, and train a model for about 200 epochs, but check the segmentation acc from validation.
Then set the previous_training_path whatever it saved from previous trianing, set config.learning_rate = 1e-3, and config.pre_train as False and do fine-tuning about 800 epoch or do validation once a while during training with the saved models to stop training.
After this done, you can use the model for testing.

The model that I shared is for testing not for training. It is fully trained. You can take the model that I shared and do validation/testing without any training.

Originally posted by @MehmetAygun in https://github.com/MehmetAygun/4D-PLS/issues/7#issuecomment-892703445

ZixuanChen613 commented 2 years ago

Hey, I have a question about single-frame 4DPLS model. So your shared model is just for multi-frame (4 scans)? If I want to get the model for single-frame or do validation/testing with single-frame, should I retrain the model by this schedule right?

MehmetAygun commented 2 years ago

Hey, yeah the same schedule should work. I've never tried but, you can test the model trained on the multi-frame level and test for single frame setting and it should obtain good performance. But the best way would be training with single frames and testing with single frames.

ZixuanChen613 commented 2 years ago

Hey, yeah the same schedule should work. I've never tried but, you can test the model trained on the multi-frame level and test for single frame setting and it should obtain good performance. But the best way would be training with single frames and testing with single frames.

Thank you for your quick reply. So you mean that you didn't train the single-frame 4DPLS model? You just use the model which you shared to evaluate the single-frame, 2 scans and 4 scans?

MehmetAygun commented 2 years ago

No, actually I've trained single-frame models, and the result in Table 3 of our paper is obtained with such model (trained with a single frame and tested with a single frame).

But, unfortunately, I don't have those pre-trained models to share. But with the training schedule that I shared, it should be possible to get a model with similar performance.

ZixuanChen613 commented 2 years ago

No, actually I've trained single-frame models, and the result in Table 3 of our paper is obtained with such model (trained with a single frame and tested with a single frame).

But, unfortunately, I don't have those pre-trained models to share. But with the training schedule that I shared, it should be possible to get a model with similar performance.

So that means, when I run test_models.py to test the single frame, I need to set config.n_frames = 1 and config.n_test_frames =1? right?

MehmetAygun commented 2 years ago

No, actually I've trained single-frame models, and the result in Table 3 of our paper is obtained with such model (trained with a single frame and tested with a single frame). But, unfortunately, I don't have those pre-trained models to share. But with the training schedule that I shared, it should be possible to get a model with similar performance.

So that means, when I run test_models.py to test the single frame, I need to set config.n_frames = 1 and config.n_test_frames =1? right?

Yes

ZixuanChen613 commented 2 years ago

No, actually I've trained single-frame models, and the result in Table 3 of our paper is obtained with such model (trained with a single frame and tested with a single frame). But, unfortunately, I don't have those pre-trained models to share. But with the training schedule that I shared, it should be possible to get a model with similar performance.

So that means, when I run test_models.py to test the single frame, I need to set config.n_frames = 1 and config.n_test_frames =1? right?

Yes

I have followed the schedule and got the trained model for single-frame. But when I do the test phase, I found that I can't get any predicted instance id. Because the value here (https://github.com/MehmetAygun/4D-PLS/blob/4d6985260deae6bb52e99af34111fca1089e4168/models/architectures.py#L606) sorted is around 0.3 , smaller than threshold 0.7, so they can't associate themselves. Do you know what cause this problem? Maybe the trained model (single-frame) is not good? Or somethings that I need to modify?

MehmetAygun commented 2 years ago

Hi,

The sorted array contains centerness scores (instance center probability), normally there should be a lot of centers with a greater probability than 0.5. My prediction is that the pre-training stage might be sorter which leads to fewer centers.

What is the centerness loss during pre-training/training (L_C) and is it optimizing correctly? https://github.com/MehmetAygun/4D-PLS/blob/ea349c0e0989bb4838544e397f821352a0770eeb/utils/trainer.py#L244

You can also visualize the centerness scores of training samples or test samples to check if there is a problem in the data.

ZixuanChen613 commented 2 years ago

Hi,

The sorted array contains centerness scores (instance center probability), normally there should be a lot of centers with a greater probability than 0.5. My prediction is that the pre-training stage might be sorter which leads to fewer centers.

What is the centerness loss during pre-training/training (L_C) and is it optimizing correctly?

https://github.com/MehmetAygun/4D-PLS/blob/ea349c0e0989bb4838544e397f821352a0770eeb/utils/trainer.py#L244

You can also visualize the centerness scores of training samples or test samples to check if there is a problem in the data.

Thanks. I will check the centerness scores during training.

ZixuanChen613 commented 2 years ago

Hi,

The sorted array contains centerness scores (instance center probability), normally there should be a lot of centers with a greater probability than 0.5. My prediction is that the pre-training stage might be sorter which leads to fewer centers.

What is the centerness loss during pre-training/training (L_C) and is it optimizing correctly?

https://github.com/MehmetAygun/4D-PLS/blob/ea349c0e0989bb4838544e397f821352a0770eeb/utils/trainer.py#L244

You can also visualize the centerness scores of training samples or test samples to check if there is a problem in the data.

I trained the single-frame model by setting config.n_frames = 1 and config.n_test_frames =1 in train_SemanticKitti.py. But also get no instance by testing.

That was the shown by training as following:

e1272-i0291 => L=0.821 L_C=0.159 L_I=1.299 L_V=0.067 L_VL2=2.885 acc= 78% / t(ms): 895.0 52.6 89.6) e1272-i0294 => L=0.384 L_C=0.031 L_I=0.442 L_V=0.010 L_VL2=3.348 acc= 89% / t(ms): 1910.5 51.7 88.9) e1272-i0301 => L=1.191 L_C=0.527 L_I=2.030 L_V=0.130 L_VL2=2.884 acc= 83% / t(ms): 915.2 53.6 87.9) e1272-i0302 => L=0.571 L_C=0.073 L_I=0.253 L_V=0.039 L_VL2=2.787 acc= 86% / t(ms): 975.3 52.7 87.1) e1272-i0304 => L=0.532 L_C=0.039 L_I=0.322 L_V=0.023 L_VL2=3.324 acc= 85% / t(ms): 2150.5 51.1 86.9) e1272-i0312 => L=0.775 L_C=0.174 L_I=0.362 L_V=0.046 L_VL2=2.394 acc= 80% / t(ms): 927.3 49.5 85.1) e1272-i0314 => L=0.675 L_C=0.249 L_I=0.362 L_V=0.047 L_VL2=3.034 acc= 87% / t(ms): 1856.7 48.2 83.6) e1272-i0322 => L=0.671 L_C=0.081 L_I=0.599 L_V=0.038 L_VL2=3.442 acc= 83% / t(ms): 800.8 49.5 83.3) e1272-i0324 => L=0.932 L_C=0.539 L_I=0.837 L_V=0.057 L_VL2=2.697 acc= 87% / t(ms): 1995.4 52.0 84.5) e1272-i0331 => L=0.794 L_C=0.415 L_I=0.815 L_V=0.070 L_VL2=2.809 acc= 87% / t(ms): 955.8 52.0 87.3) e1272-i0334 => L=0.546 L_C=0.081 L_I=1.097 L_V=0.014 L_VL2=3.895 acc= 85% / t(ms): 1784.9 50.1 85.2) e1272-i0337 => L=0.697 L_C=0.160 L_I=0.681 L_V=0.025 L_VL2=2.544 acc= 80% / t(ms): 1351.9 49.1 85.9)

And I found the sorted centerness scores are between 0.2 and 0.4 (test single frame by single-frame trained model parameters) lower than threshold 0.7 and the data distribution is very close, while the sorted centerness scores are between 0.04 and 0.95 (test single frame by multi-frame trained model that you shared) and the data distribution is sparser.

What does it mean? Wheater I need to do more epoch by pre-training? or more training? It cost around one week in colab. :(

MehmetAygun commented 2 years ago

Hi,

I don't think you need to do more pre-training, normally centerness scores should be very easy to learn, and around 600-800 epochs should be fine for pertaining.

You should first set config.pre_train = True, to do pre-training, then change this to False and set the previous_training_path whatever the pre-trained model path. Did you do this during pretraining/post-training?

Also, what is the GPU memory size in colab? As the input point cloud is large, KPConv does random sampling based on GPU size, and if you use a small GPU memory (<24GB), the input becomes too sparse and the model performs poorly.

ZixuanChen613 commented 2 years ago

Hi,

I don't think you need to do more pre-training, normally centerness scores should be very easy to learn, and around 600-800 epochs should be fine for pertaining.

You should first set config.pre_train = True, to do pre-training, then change this to False and set the previous_training_path whatever the pre-trained model path. Did you do this during pretraining/post-training?

Also, what is the GPU memory size in colab? As the input point cloud is large, KPConv does random sampling based on GPU size, and if you use a small GPU memory (<24GB), the input becomes too sparse and the model performs poorly.

Hi, Yeap I followed the pretrainging/training. By pre-training I set config.pre_train = True and config.learning_rate = 1e-2. After around 200 epoch, I set config.pre_train = False, config.learning_rate = 1e-3 and add previous_training_path.

The GPU memory size is greater than 24GB. I used colab pro+. But It can only run 24h, then it will be break and I need to modify previous_training_path to continue to train the model. Maybe the resources by colab are different each time，whether it will cause some problems or not?

MehmetAygun commented 2 years ago

Can you try longer pre-training like around 400-600 epochs?

ZixuanChen613 commented 2 years ago

Can you try longer pre-training like around 400-600 epochs?

single-scan training

I found the lr = 0.0001 instead of 0.001(1e-3) in the parameters.txt file of the folder "Log_2020-10-06_16-51-05" that the fullly trained model you shared. Is there something wrong?

After pretraining with lr=0.01. I train the model with lr=0.0001, then the model converge (mIoU) better than by using lr=0.001. But for the center head didn't train well, I also get the probability smaller than 0.5. Whether need I during the pretraining with lr=0.001 or not? I am confused, that I can get well trained mIoU, but can't get well trained center head.

MehmetAygun commented 2 years ago

For pre-training, you should start with lr=0.01, and also adjust the max_epoch number as it adjusts the learning rate schedule: https://github.com/MehmetAygun/4D-PLS/blob/4d6985260deae6bb52e99af34111fca1089e4168/train_SemanticKitti.py#L163.

After pre-training is done you should be able to get good mIoU and center predictions, if they are not good, the instance training stage would not converge. I am not sure about the learning rate for the later stage, but if the file shows 1e-4 then it should be like that, but it is optimized for a multi-frame training setting, so I cannot say which one would work best for the single-frame setting.

ZixuanChen613 commented 2 years ago

For pre-training, you should start with lr=0.01, and also adjust the max_epoch number as it adjusts the learning rate schedule:

https://github.com/MehmetAygun/4D-PLS/blob/4d6985260deae6bb52e99af34111fca1089e4168/train_SemanticKitti.py#L163

. After pre-training is done you should be able to get good mIoU and center predictions, if they are not good, the instance training stage would not converge. I am not sure about the learning rate for the later stage, but if the file shows 1e-4 then it should be like that, but it is optimized for a multi-frame training setting, so I cannot say which one would work best for the single-frame setting.

Thanks for your reply. So what does max_epoch means in the pre-training stage? If I want to train 600 epoch in the pre-training stage. I need to set the max_epoch >= 600 right? And after pre-training, if I want to train instance loss, I also need to chage this value max_epoch?

MehmetAygun commented 2 years ago

max_epochs adjust the learning rate schedule. If you are planning to pre-train for 600 epochs, set it to 600. And for the instance loss phase again set whatever #epoch you are planning to use. Otherwise, networks will be trained only with high learning rates and this might lead to poor convergence.

ZixuanChen613 commented 2 years ago

And whether the value i in the line 163 is related with the i-th epoch during training? Because there is a limit runtime in colab. I need to firstly run 0th-200th epoch, then restart colab and continue to train 200th - 400th epochs. I don't know if this will affect model training?

Besides if I want to do 600 epochs in the pretraining stage, and 600 epochs in the instance loss phase. I need to set max_epochs=600 or 1200 in the instance loss stage?

MehmetAygun commented 2 years ago

It should be 600. The value in the L163 is not related to the i-th epoch as it is the same for all epochs. It shouldn't be a problem to continue after 200 epochs as the optimizer state is recovered if you give a trained model path. Just be sure that the code goes to this if-clause: https://github.com/MehmetAygun/4D-PLS/blob/4d6985260deae6bb52e99af34111fca1089e4168/utils/trainer.py#L133

ZixuanChen613 commented 2 years ago

It should be 600. The value in the L163 is not related to the i-th epoch as it is the same for all epochs. It shouldn't be a problem to continue after 200 epochs as the optimizer state is recovered if you give a trained model path. Just be sure that the code goes to this if-clause:

https://github.com/MehmetAygun/4D-PLS/blob/4d6985260deae6bb52e99af34111fca1089e4168/utils/trainer.py#L133

Thanks. And should I modify lr_decays = {i: 0.1 (1 / 600**) for i in range(1, max_epoch)} ? change 200 to 600?

MehmetAygun commented 2 years ago

Yes.

ZixuanChen613 commented 2 years ago

Yes.

Hi, sry it's still me. :) I have tried to train 600 epoch in the pre-training stage. And I saw the val_IoUs.txt as following:

0.558 0.000 0.000 0.037 0.001 0.000 0.011 0.000 0.690 0.035 0.479 0.000 0.630 0.296 0.718 0.371 0.504 0.003 0.000 0.544 0.000 0.000 0.019 0.000 0.000 0.005 0.000 0.704 0.028 0.489 0.000 0.655 0.285 0.727 0.374 0.505 0.234 0.000 0.644 0.000 0.000 0.121 0.000 0.000 0.003 0.000 0.655 0.020 0.445 0.000 0.693 0.308 0.740 0.395 0.392 0.292 0.000 0.702 0.000 0.000 0.115 0.000 0.000 0.003 0.008 0.670 0.037 0.454 0.000 0.708 0.310 0.753 0.392 0.429 0.336 0.006 0.734 0.000 0.004 0.097 0.015 0.002 0.023 0.005 0.687 0.030 0.464 0.000 0.716 0.299 0.751 0.416 0.427 0.357 0.005 0.748 0.000 0.003 0.076 0.014 0.002 0.020 0.004 0.701 0.050 0.485 0.000 0.733 0.293 0.759 0.421 0.471 0.364 0.005 0.762 0.000 0.024 0.076 0.012 0.047 0.025 0.006 0.702 0.051 0.492 0.000 0.719 0.292 0.761 0.420 0.486 0.357 0.004 0.761 0.000 0.025 0.086 0.011 0.068 0.054 0.008 0.715 0.047 0.507 0.000 0.727 0.294 0.766 0.427 0.512 0.381 0.003 0.777 0.000 0.024 0.085 0.021 0.080 0.081 0.006 0.723 0.072 0.509 0.000 0.726 0.309 0.770 0.434 0.530 0.389 0.010 0.791 0.000 0.034 0.095 0.020 0.091 0.079 0.005 0.731 0.079 0.515 0.000 0.736 0.332 0.775 0.437 0.531 0.407 0.039 0.801 0.001 0.049 0.095 0.021 0.129 0.099 0.005 0.737 0.075 0.524 0.000 0.742 0.334 0.776 0.444 0.547 0.418 0.039 0.811 0.001 0.053 0.116 0.020 0.128 0.101 0.005 0.747 0.091 0.537 0.000 0.750 0.343 0.779 0.448 0.560 0.420 0.046 0.822 0.002 0.054 0.120 0.030 0.155 0.108 0.004 0.752 0.099 0.545 0.000 0.753 0.345 0.782 0.453 0.563 0.430 0.078

and

SemanticKitti : subpart mIoU = 41.0 % SemanticKitti : val mIoU = 32.1 % SemanticKitti : val center mIoU = 67.0 % SemanticKitti : val centers sum = 4315.3 %

I found the max_in_points = 5491 and max_val_points = 56240 are smaller than pre-trained models that 4scans. What do these two parameters mean? Should I need more epochs in the pre-training stage?

MehmetAygun commented 2 years ago

Hi,

I would say that these numbers are a little low compared to my pre-trainings (If I remember correctly val mIoU should be around 45s and center mIoU should be around 80s), which is probably due to max_in_points.

As I said before, depending on the GPU memory size Kpconv backbone does random sampling and reduces the number of input points to adjust memory usage. If you use less number of points, the method's performance decreases. I was using 48GB GPUs to train these models and the training script adjust these numbers automatically. If you want to adjust these numbers I would suggest looking original repo of KPConv.

MehmetAygun / 4D-PLS

single-frame 4DPLS model #9

single-scan training

SemanticKitti : subpart mIoU = 41.0 % SemanticKitti : val mIoU = 32.1 % SemanticKitti : val center mIoU = 67.0 % SemanticKitti : val centers sum = 4315.3 %