crockwell / Cap3D

[NeurIPS 2023] Scalable 3D Captioning with Pretrained Models
https://huggingface.co/datasets/tiange/Cap3D
218 stars 13 forks source link

point clouds from 16384x6->4096|1024x6 #3

Closed YueWuHKUST closed 1 year ago

YueWuHKUST commented 1 year ago

Thanks for your wonderful work! I notice that you finetune the point-e model. And Cap3D dataset provide point clouds in 16384x6, but point-e requires 1024x6 for stage1 training, and 4096 x 6 for stage2 upsampling. How do you convert 16384 to 1024 or 4096? Could you provide more details?

tiangeluo commented 1 year ago

Hi Yue, thank you for your interests.

For finetuning the Point-E model, we only finetune the stage 1 diffusion model (text -> 1024 x 6, base40M-textvec). You can refer to https://github.com/crockwell/Cap3D/blob/3025a085abc19fe7532ae8d8d34e6689fa8b3847/text-to-3D/finetune_pointE.py#L119-L124 During inference, we use a fine-tuned first-stage diffusion model and a pre-trained second-stage upsampling model (1024 -> 4096).

For converting 16,384 points into 1,024 points, in paper, we randomly sample 1024 points out of 16384 before training (the sampled 1024 points are fixed during training): [:,torch.randperm(16384)[:1024]]. A better way is to perform farthest-point-sampling over 16384 points.

YueWuHKUST commented 1 year ago

Thanks for your quick reply! I understand your solution.