PRBonn / LiDiff

[CVPR'24] Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion
MIT License
216 stars 19 forks source link

Training with custom dataset. #14

Closed Yacovitch closed 2 months ago

Yacovitch commented 5 months ago

Hi!,

I am training and testing this network with my custom dataset. I was able to modify the code for training but my results look bizarre, have you encountered this issue? Also, the training loss decreases, but after a few epochs, it is stuck at around 0.9. Are there any configurations that I am missing?

Here is the input, ground truth, and the result, respectively. image image image

nuneslu commented 5 months ago

Hi! So, first regarding the loss getting stuck, I would say that it is normal. Also when training with semanticKITTI I was also noticing the same, but even with it “stuck” the results kept improving. Regarding the results, are those the results from the diffusion or from the refinement network? Because it looks quite dense already.

Can you also save the point cloud with the noise added? In the pipeline code it should be x_feats from the “complete_scan” method.

I would suggest checking the results on the previous checkpoints as well and also you can try varying the conditioning weights “uncond_w” in the config.

Yacovitch commented 5 months ago

Hi, thank you for your reply! This is just the diffusion result because I want to ensure the diffusion part works first, and then I would like to move on to modifying the refinement network.

I extracted x_feats from "complete_scan" and found this. image

Also, I am attaching results from epochs 4, 9, 14, and 19. Your insight on the result would be very appreciated! It seems like as training progresses, the result gets worse. Is it because of overfitting? or not enough data? image

nuneslu commented 5 months ago

To check whether it is overfitting I suggest to give as input one of the training point clouds. Also, you can try changing the conditioning weights in the completion pipeline, i.e., -s parameter. Another thing is that you can try giving as input a sparser point cloud, by default we sample 18000 points to give as input, you can try reducing it by half in the completion pipeline.

Yacovitch commented 4 months ago

Is there any possibility that I do not have enough epochs? Because I tested on the training set, and it gave me similar results. Or I do not have enough training data? I made a slight modification to the code, and now it uses random sampling instead of furthest point sampling and each epoch takes around 30 mins to complete.

nuneslu commented 4 months ago

You could be that your dataset is too small. How many samples do you have? For me to have a better view of it, could you share the results using the conditioning parameter -s set to 2 and to 12?

The random sampling should work in your case, for LiDAR scans we stick to the farthest point sampling because the point cloud has too many points in the center area, so the random sampling would ignore farther regions in the conditioning pcd.

Yacovitch commented 4 months ago

I have 2794 samples in total. And here are the results from two inputs. image

nuneslu commented 4 months ago

Can you share the config file used for training and the command you used to generate those results? I want to check the hyperparameters used for training and during inference. From the images it is hard to grasp especially because it is quite hard to see the input point cloud.

Yacovitch commented 4 months ago

Sorry, I think this shows better view of input point clouds. image

This is the configuration for training:

experiment: id: sensat_radius_9000_input

Data

data: data_dir: './Datasets/Sensat_0.200_radius_12' resolution: 0.2 dataloader: 'Sensat' split: 'train' train: [ 'birmingham_block_0', 'birmingham_block_12', 'birmingham_block_10', 'birmingham_block_13', 'birmingham_block_11', 'birmingham_block_4', 'birmingham_block_3', 'birmingham_block_9', 'birmingham_block_6', 'birmingham_block_7', 'cambridge_block_12', 'cambridge_block_17', 'cambridge_block_18', 'cambridge_block_19', 'cambridge_block_14', 'cambridge_block_2', 'cambridge_block_23', 'cambridge_block_20', 'cambridge_block_21', 'cambridge_block_25', 'cambridge_block_26', 'cambridge_block_28', 'cambridge_block_3', 'cambridge_block_32', 'cambridge_block_34', 'cambridge_block_33', 'cambridge_block_6', 'cambridge_block_4', 'cambridge_block_9' ] validation: [ 'birmingham_block_1', 'birmingham_block_5', 'cambridge_block_10', 'cambridge_block_7' ] test: ['birmingham_block_2', 'birmingham_block_8', 'cambridge_block_15', 'cambridge_block_22', 'cambridge_block_16', 'cambridge_block_27'] num_points: 180000 max_range: 50. dataset_norm: False std_axis_norm: False

Training

train: uncond_prob: 0.1 uncond_w: 6. n_gpus: 2 num_workers: 4 max_epoch: 20 lr: 0.0001 batch_size: 2 decay_lr: 1.0e-4

diff: beta_start: 3.5e-5 beta_end: 0.007 beta_func: 'linear' t_steps: 1000 s_steps: 50 reg_weight: 5.0

Network

model: out_dim: 96

As you suggested, I modified the data loading code to take 9000 inputs instead of 18000 points. For generating results. Those results are diffusion output. Refinement network is not used. I set denoising_steps (-T) as 50 and cond_weight (-s) as 2 and 12 respectively.

Yacovitch commented 4 months ago

I generated training data from the SensatUrban dataset, which features an average point density of 320 points per square meter. ground truth Data samples were collected using radius sampling methods: centroids were randomly selected, and all point clouds within a 12-meter radius were gathered. Subsequently, the data was downsampled using grid sampling with a voxel size of 0.2 meters to generate input data.

nuneslu commented 4 months ago

Alright! Can you try generating again with more denoising steps, with -T 200 for example?

Yacovitch commented 4 months ago

I used -T 200, which does not change the result drastically, but I think it has a better resolution. image

I am also attaching an image that shows overlapping between inputs (blue and thicker points) and generated results (yellow points). image

nuneslu commented 4 months ago

I would suggest training for longer and removing the scheduler from models.py

Since you have less data you for sure need to train for longer, but with the scheduler, the learning rate would decrease faster than it should and the model would not converge.

Yacovitch commented 4 months ago

Ok, thank you for the suggestion. I have two questions.

  1. How do I disable the scheduler?
  2. How many samples would be sufficient?
nuneslu commented 4 months ago
  1. You can replace this line with just "return optimizer"
  2. To train the diffusion model the amount of iterations counts a lot (since at each training iteration a random step over the T denoising steps is trained only), so either you can increase the number of samples to be around the same amount as the semanticKITTI or increase the number of epochs proportionally.
Yacovitch commented 4 months ago

I trained without a scheduler, but after 14 epochs, loss becomes nan. So I checked the generated result and found out all generated points are placed in origin (all 180000 points have XYZ 0,0,0). Have you encountered this problem?

I checked the total number of samples; Kittiti semantic has 43552, and my data has 43170. Do you still suggest increasing the sample size?

Also, I realized that the generated points always cover a larger area than the input points. Could this be an issue?

Lastly, as I mentioned previously, I am utilizing radius sampling. Would KNN sampling (fixed number of sampling points instead of radius) be beneficial?

nuneslu commented 4 months ago

If the loss goes to nan the learning is too high, you can decrease it by half and try training again. Regarding the points placed in origin I haven’t seen this problem before.

If the total samples are the same it should be fine, no need to increase the amount of samples.

Regarding the point cloud generating a larger area it is fine, it should converge closer to the ground truth but it is normal for the model to “allucinate” a bit outside the expected range.

I tried before with KNN using K bigger than 1 and it also diverged, so I would suggest using KNN as the original implementation first and once it works with the original implementation you can try changing and comparing it.

I have been using the same code for a different project recently and it was working with minor changes. If you think it could help we can schedule a call to discuss it better. :)

Yacovitch commented 4 months ago

That sounds fantastic! When are you available? I have quite flexible hours, so I will match my time based on your availability.

nuneslu commented 4 months ago

You can send me an email and we can arrange the meeting through the email: lucas.nunes@igg.uni-bonn.de

Yacovitch commented 4 months ago

Hi there, thank you for the discussion yesterday, it was very informative! As we discussed yesterday, I visualized input with noise with different steps. I would like to share the visualization here. image

The top row is from KITTI dataset with the default setting 3.5e-5 to 0.007. 2nd row is from my custom dataset with the default setting 3.5e-5 to 0.007. 3rd row is from my custom dataset with the setting 1e-5 to 0.001.

T=999 on the second row is too fuzzy, so I reduced the noise by decreasing the beta. The result with beta 1e-5 to 0.001 is the following: image

As you can see the result got worse. Did I do something wrong when I am adjusting beta values? Or Does this imply that I have to add more noise instead?

Also, I tried to set reg_weight to 0, and I got this result: image

The network does not converge well when reg_weight is set to 0. What do you think?

nuneslu commented 4 months ago

I would try to replicate the same setup that we have on SemanticKITTI but using your data. So, using the default parameters and trying to have the condition point cloud as sparse as the one you would have when using the SemanticKITTI data. After it converges with the sparser data then I would try to do modifications to work with the denser data.

Yacovitch commented 4 months ago

Hi there, I have great news! It finally worked! Here is what I did. I regenerated data similar to the KITTI data set (50m range, 180,000 points clouds). And I realized that I sat the resolution to 0.2 instead of 0.05. I changed it back to 0.05 and it worked. Thank you very much for all of your help!

nuneslu commented 4 months ago

I’m really glad to hear that! If you don’t mind, could you share here some results? I’m curious to see the results.

Now I would say that you can try changing it to use a denser point cloud as the condition. If any question arise please don’t hesitate to contact me back. :)

Yacovitch commented 4 months ago

I validated the result using a small amount of training data (3000 training samples). Now, I am training the network with more data (40000 training samples). Once I get the results from this training, I will definitely share the results with you! Once again, thank you very much.

Yacovitch commented 3 months ago

Hi there, sorry for the delayed reply. Here is my visualization of the results.

image

Generated point clouds are still a bit noisy, but I am happy that I was able to make the network work on my dataset! Do you have any suggestions on how to suppress the noise?

Yacovitch commented 3 months ago

Also, I am trying to extract the performance on my custom dataset. I was able to run the test mode by running python3 train.py -t, but when I see the generated point clouds, all of the produced results look something like this: image where CD is 2.64, am I doing something wrong? I am also attaching lines 22-40 of train.py here.

@click.command()
### Add your options here
@click.option('--config',
              '-c',
              type=str,
              help='path to the config file (.yaml)',
              default=join(dirname(abspath(__file__)),'config/config.yaml'))
@click.option('--weights',
              '-w',
              type=str,
              default='/nas2/jacob/LiDiff/lidiff/experiments/sensat_random_radius_50/default/version_13/checkpoints/sensat_random_radius_50_epoch=19.ckpt'
              help='path to pretrained weights (.ckpt). Use this flag if you just want to load the weights from the checkpoint file without resuming training.',
              default=None)
@click.option('--checkpoint',
              '-ckpt',
              type=str,
              help='path to checkpoint file (.ckpt) to resume training.',
              default=None)
@click.option('--test', '-t', is_flag=True, help='test mode')
nuneslu commented 3 months ago

Hi there, sorry for the delayed reply. Here is my visualization of the results.

image

Generated point clouds are still a bit noisy, but I am happy that I was able to make the network work on my dataset! Do you have any suggestions on how to suppress the noise?

Glad to hear that it worked out! To suppress the noise you can try increase the number of steps during inference. On our supplement material we show how it impacts the final result: image

With 50 steps it generates a reasonable point cloud but by increasing the number of denoising steps you should get better results.

nuneslu commented 3 months ago

Also, I am trying to extract the performance on my custom dataset. I was able to run the test mode by running python3 train.py -t, but when I see the generated point clouds, all of the produced results look something like this: image where CD is 2.64, am I doing something wrong? I am also attaching lines 22-40 of train.py here.

@click.command()
### Add your options here
@click.option('--config',
              '-c',
              type=str,
              help='path to the config file (.yaml)',
              default=join(dirname(abspath(__file__)),'config/config.yaml'))
@click.option('--weights',
              '-w',
              type=str,
              default='/nas2/jacob/LiDiff/lidiff/experiments/sensat_random_radius_50/default/version_13/checkpoints/sensat_random_radius_50_epoch=19.ckpt'
              help='path to pretrained weights (.ckpt). Use this flag if you just want to load the weights from the checkpoint file without resuming training.',
              default=None)
@click.option('--checkpoint',
              '-ckpt',
              type=str,
              help='path to checkpoint file (.ckpt) to resume training.',
              default=None)
@click.option('--test', '-t', is_flag=True, help='test mode')

For the evaluation, I would suggest using the diff_completion_pipeline.py to generate the point clouds and evaluate them afterward. I will take a look into the implementation of the test_step and check whether there is a bug there because actually, I have always used the pipeline to generate and evaluate the results, so there may be some bug on the model test_step.

Yacovitch commented 3 months ago

Could you share your evaluation code?

nuneslu commented 3 months ago

The evaluation file is also in the repo at lidiff/utils/eval_path.py

nuneslu commented 2 months ago

I will close this issue now. In case you have further questions, feel free to reopen it. :)