It surprises me that 7 dims input for MLP performs so well, while other sotas use high-dim feature encoding. You attribute this to the absolutely local feature, which is contradictory to PiFUhd. They found that "3D reconstruction using high-resolution features without holistic reasoning severely suffers from depth ambiguity and is unable to generalize with input size discrepancy between training and inference." Since you both use normal map to recover local details, I wonder if the main contributors to reduce depth ambiguity in your work are the first two terms, the distance to nearest SMPL vertices and its normal. So I integrate them to PaMIR but it still cannot generalize well in the wild images. I notice the three terms of feature are all with certain geometry property, either distance in $R^3$ or surface normal. Is it possible that pure geometry feature encoding is better to narrow the gap between input of implicit function and occupancy field to be learned? Or, if the 2D and 3D receptive field are closed to 1, previous sotas still cannot be improved?
SDF matters most for ICON, the improvement brought from body-normal is minor
the main reason why PaMIR/PIFuHD cannot generalize well on unseen poses is their using 2D/3D global encoder
I used to reduce the reception field of PaMIR, leading to even worse reconstruction. I guess it is because ICON's SMPL prior plays the same role as PaMIR's voxel feature, yet the voxel feature will lose holistic information under the smaller reception field
It surprises me that 7 dims input for MLP performs so well, while other sotas use high-dim feature encoding. You attribute this to the absolutely local feature, which is contradictory to PiFUhd. They found that "3D reconstruction using high-resolution features without holistic reasoning severely suffers from depth ambiguity and is unable to generalize with input size discrepancy between training and inference." Since you both use normal map to recover local details, I wonder if the main contributors to reduce depth ambiguity in your work are the first two terms, the distance to nearest SMPL vertices and its normal. So I integrate them to PaMIR but it still cannot generalize well in the wild images. I notice the three terms of feature are all with certain geometry property, either distance in $R^3$ or surface normal. Is it possible that pure geometry feature encoding is better to narrow the gap between input of implicit function and occupancy field to be learned? Or, if the 2D and 3D receptive field are closed to 1, previous sotas still cannot be improved?