Closed Yacovitch closed 2 months ago
Hi! So, first regarding the loss getting stuck, I would say that it is normal. Also when training with semanticKITTI I was also noticing the same, but even with it “stuck” the results kept improving. Regarding the results, are those the results from the diffusion or from the refinement network? Because it looks quite dense already.
Can you also save the point cloud with the noise added? In the pipeline code it should be x_feats from the “complete_scan” method.
I would suggest checking the results on the previous checkpoints as well and also you can try varying the conditioning weights “uncond_w” in the config.
Hi, thank you for your reply! This is just the diffusion result because I want to ensure the diffusion part works first, and then I would like to move on to modifying the refinement network.
I extracted x_feats from "complete_scan" and found this.
Also, I am attaching results from epochs 4, 9, 14, and 19. Your insight on the result would be very appreciated! It seems like as training progresses, the result gets worse. Is it because of overfitting? or not enough data?
To check whether it is overfitting I suggest to give as input one of the training point clouds. Also, you can try changing the conditioning weights in the completion pipeline, i.e., -s parameter. Another thing is that you can try giving as input a sparser point cloud, by default we sample 18000 points to give as input, you can try reducing it by half in the completion pipeline.
Is there any possibility that I do not have enough epochs? Because I tested on the training set, and it gave me similar results. Or I do not have enough training data? I made a slight modification to the code, and now it uses random sampling instead of furthest point sampling and each epoch takes around 30 mins to complete.
You could be that your dataset is too small. How many samples do you have? For me to have a better view of it, could you share the results using the conditioning parameter -s set to 2 and to 12?
The random sampling should work in your case, for LiDAR scans we stick to the farthest point sampling because the point cloud has too many points in the center area, so the random sampling would ignore farther regions in the conditioning pcd.
I have 2794 samples in total. And here are the results from two inputs.
Can you share the config file used for training and the command you used to generate those results? I want to check the hyperparameters used for training and during inference. From the images it is hard to grasp especially because it is quite hard to see the input point cloud.
Sorry, I think this shows better view of input point clouds.
This is the configuration for training:
experiment: id: sensat_radius_9000_input
data: data_dir: './Datasets/Sensat_0.200_radius_12' resolution: 0.2 dataloader: 'Sensat' split: 'train' train: [ 'birmingham_block_0', 'birmingham_block_12', 'birmingham_block_10', 'birmingham_block_13', 'birmingham_block_11', 'birmingham_block_4', 'birmingham_block_3', 'birmingham_block_9', 'birmingham_block_6', 'birmingham_block_7', 'cambridge_block_12', 'cambridge_block_17', 'cambridge_block_18', 'cambridge_block_19', 'cambridge_block_14', 'cambridge_block_2', 'cambridge_block_23', 'cambridge_block_20', 'cambridge_block_21', 'cambridge_block_25', 'cambridge_block_26', 'cambridge_block_28', 'cambridge_block_3', 'cambridge_block_32', 'cambridge_block_34', 'cambridge_block_33', 'cambridge_block_6', 'cambridge_block_4', 'cambridge_block_9' ] validation: [ 'birmingham_block_1', 'birmingham_block_5', 'cambridge_block_10', 'cambridge_block_7' ] test: ['birmingham_block_2', 'birmingham_block_8', 'cambridge_block_15', 'cambridge_block_22', 'cambridge_block_16', 'cambridge_block_27'] num_points: 180000 max_range: 50. dataset_norm: False std_axis_norm: False
train: uncond_prob: 0.1 uncond_w: 6. n_gpus: 2 num_workers: 4 max_epoch: 20 lr: 0.0001 batch_size: 2 decay_lr: 1.0e-4
diff: beta_start: 3.5e-5 beta_end: 0.007 beta_func: 'linear' t_steps: 1000 s_steps: 50 reg_weight: 5.0
model: out_dim: 96
As you suggested, I modified the data loading code to take 9000 inputs instead of 18000 points.
For generating results. Those results are diffusion output. Refinement network is not used. I set denoising_steps
(-T) as 50 and cond_weight
(-s) as 2 and 12 respectively.
I generated training data from the SensatUrban dataset, which features an average point density of 320 points per square meter. ground truth Data samples were collected using radius sampling methods: centroids were randomly selected, and all point clouds within a 12-meter radius were gathered. Subsequently, the data was downsampled using grid sampling with a voxel size of 0.2 meters to generate input data.
Alright! Can you try generating again with more denoising steps, with -T 200
for example?
I used -T 200
, which does not change the result drastically, but I think it has a better resolution.
I am also attaching an image that shows overlapping between inputs (blue and thicker points) and generated results (yellow points).
I would suggest training for longer and removing the scheduler from models.py
Since you have less data you for sure need to train for longer, but with the scheduler, the learning rate would decrease faster than it should and the model would not converge.
Ok, thank you for the suggestion. I have two questions.
I trained without a scheduler, but after 14 epochs, loss becomes nan. So I checked the generated result and found out all generated points are placed in origin (all 180000 points have XYZ 0,0,0). Have you encountered this problem?
I checked the total number of samples; Kittiti semantic has 43552, and my data has 43170. Do you still suggest increasing the sample size?
Also, I realized that the generated points always cover a larger area than the input points. Could this be an issue?
Lastly, as I mentioned previously, I am utilizing radius sampling. Would KNN sampling (fixed number of sampling points instead of radius) be beneficial?
If the loss goes to nan the learning is too high, you can decrease it by half and try training again. Regarding the points placed in origin I haven’t seen this problem before.
If the total samples are the same it should be fine, no need to increase the amount of samples.
Regarding the point cloud generating a larger area it is fine, it should converge closer to the ground truth but it is normal for the model to “allucinate” a bit outside the expected range.
I tried before with KNN using K bigger than 1 and it also diverged, so I would suggest using KNN as the original implementation first and once it works with the original implementation you can try changing and comparing it.
I have been using the same code for a different project recently and it was working with minor changes. If you think it could help we can schedule a call to discuss it better. :)
That sounds fantastic! When are you available? I have quite flexible hours, so I will match my time based on your availability.
You can send me an email and we can arrange the meeting through the email: lucas.nunes@igg.uni-bonn.de
Hi there, thank you for the discussion yesterday, it was very informative! As we discussed yesterday, I visualized input with noise with different steps. I would like to share the visualization here.
The top row is from KITTI dataset with the default setting 3.5e-5 to 0.007. 2nd row is from my custom dataset with the default setting 3.5e-5 to 0.007. 3rd row is from my custom dataset with the setting 1e-5 to 0.001.
T=999 on the second row is too fuzzy, so I reduced the noise by decreasing the beta. The result with beta 1e-5 to 0.001 is the following:
As you can see the result got worse. Did I do something wrong when I am adjusting beta values? Or Does this imply that I have to add more noise instead?
Also, I tried to set reg_weight
to 0, and I got this result:
The network does not converge well when reg_weight is set to 0. What do you think?
I would try to replicate the same setup that we have on SemanticKITTI but using your data. So, using the default parameters and trying to have the condition point cloud as sparse as the one you would have when using the SemanticKITTI data. After it converges with the sparser data then I would try to do modifications to work with the denser data.
Hi there, I have great news! It finally worked! Here is what I did. I regenerated data similar to the KITTI data set (50m range, 180,000 points clouds). And I realized that I sat the resolution
to 0.2 instead of 0.05. I changed it back to 0.05 and it worked. Thank you very much for all of your help!
I’m really glad to hear that! If you don’t mind, could you share here some results? I’m curious to see the results.
Now I would say that you can try changing it to use a denser point cloud as the condition. If any question arise please don’t hesitate to contact me back. :)
I validated the result using a small amount of training data (3000 training samples). Now, I am training the network with more data (40000 training samples). Once I get the results from this training, I will definitely share the results with you! Once again, thank you very much.
Hi there, sorry for the delayed reply. Here is my visualization of the results.
Generated point clouds are still a bit noisy, but I am happy that I was able to make the network work on my dataset! Do you have any suggestions on how to suppress the noise?
Also, I am trying to extract the performance on my custom dataset. I was able to run the test mode by running python3 train.py -t
, but when I see the generated point clouds, all of the produced results look something like this:
where CD is 2.64, am I doing something wrong? I am also attaching lines 22-40 of train.py
here.
@click.command()
### Add your options here
@click.option('--config',
'-c',
type=str,
help='path to the config file (.yaml)',
default=join(dirname(abspath(__file__)),'config/config.yaml'))
@click.option('--weights',
'-w',
type=str,
default='/nas2/jacob/LiDiff/lidiff/experiments/sensat_random_radius_50/default/version_13/checkpoints/sensat_random_radius_50_epoch=19.ckpt'
help='path to pretrained weights (.ckpt). Use this flag if you just want to load the weights from the checkpoint file without resuming training.',
default=None)
@click.option('--checkpoint',
'-ckpt',
type=str,
help='path to checkpoint file (.ckpt) to resume training.',
default=None)
@click.option('--test', '-t', is_flag=True, help='test mode')
Hi there, sorry for the delayed reply. Here is my visualization of the results.
Generated point clouds are still a bit noisy, but I am happy that I was able to make the network work on my dataset! Do you have any suggestions on how to suppress the noise?
Glad to hear that it worked out! To suppress the noise you can try increase the number of steps during inference. On our supplement material we show how it impacts the final result:
With 50 steps it generates a reasonable point cloud but by increasing the number of denoising steps you should get better results.
Also, I am trying to extract the performance on my custom dataset. I was able to run the test mode by running
python3 train.py -t
, but when I see the generated point clouds, all of the produced results look something like this: where CD is 2.64, am I doing something wrong? I am also attaching lines 22-40 oftrain.py
here.@click.command() ### Add your options here @click.option('--config', '-c', type=str, help='path to the config file (.yaml)', default=join(dirname(abspath(__file__)),'config/config.yaml')) @click.option('--weights', '-w', type=str, default='/nas2/jacob/LiDiff/lidiff/experiments/sensat_random_radius_50/default/version_13/checkpoints/sensat_random_radius_50_epoch=19.ckpt' help='path to pretrained weights (.ckpt). Use this flag if you just want to load the weights from the checkpoint file without resuming training.', default=None) @click.option('--checkpoint', '-ckpt', type=str, help='path to checkpoint file (.ckpt) to resume training.', default=None) @click.option('--test', '-t', is_flag=True, help='test mode')
For the evaluation, I would suggest using the diff_completion_pipeline.py
to generate the point clouds and evaluate them afterward. I will take a look into the implementation of the test_step
and check whether there is a bug there because actually, I have always used the pipeline to generate and evaluate the results, so there may be some bug on the model test_step
.
Could you share your evaluation code?
The evaluation file is also in the repo at lidiff/utils/eval_path.py
I will close this issue now. In case you have further questions, feel free to reopen it. :)
Hi!,
I am training and testing this network with my custom dataset. I was able to modify the code for training but my results look bizarre, have you encountered this issue? Also, the training loss decreases, but after a few epochs, it is stuck at around 0.9. Are there any configurations that I am missing?
Here is the input, ground truth, and the result, respectively.