ViTencoder input - Githubissues

River-Zhang / GTA

[NeurIPS 23] Official repository for NeurIPS 2023 paper "Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction"

https://river-zhang.github.io/GTA-projectpage/

102 stars 4 forks source link

ViTencoder input #15

Open Followmeczx opened 5 months ago

Followmeczx commented 5 months ago

I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved?

Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature.

I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?

River-Zhang commented 5 months ago

Yes, it could improve the performance a little. As normal images of the input image are easy to acquire (using a pre-trained normal estimation model), we also input the front/back normal images. However, if you want to input a single image, the whole model may have to be retrained.

Followmeczx commented 5 months ago

I have one more question. Since I used vitpose pre-trained model, the input image resolution is (256,192) and the final feature dim is 1024. This will produce an output of size 192x1024 when passed through the encoder. I would like to use your method later. But your ViT model gets 1024x256 output. Do I need to change the resolution of the image to (512,512) and change the feature dim to 256? Or I can simply change image_size and dim to 1024. I don't know if it will have any effect on the back? It seems like (512,512) resolution is all you need for an ICON.

River-Zhang commented 5 months ago

If you just want to use our model for inference, you can just input the image and the script will automatically resize it to (512,512). However, if you want to use it in training, you will need to change the parameters and retrain the model. I'm not sure what you mean that you used the vitpose pre-trained model.

Followmeczx commented 5 months ago

I have one more question.

I can't seem to find the code in PIFuDtaset for how the sample points, labels and calib are obtained. What information is included in the calib parameter? Is it the rotation and translation of the extrinsic parameters and the focal length and principal point of the Intrinsic parameters? I used the Human36M dataset and I found that his camera parameters are only intrinsic parameters of focal length and principal point. Can you give me some advice?