About the proposed method.

It surprises me that 7 dims input for MLP performs so well, while other sotas use high-dim feature encoding. You attribute this to the absolutely local feature, which is contradictory to PiFUhd. They found that "3D reconstruction using high-resolution features without holistic reasoning severely suffers from depth ambiguity and is unable to generalize with input size discrepancy between training and inference." Since you both use normal map to recover local details, I wonder if the main contributors to reduce depth ambiguity in your work are the first two terms, the distance to nearest SMPL vertices and its normal. So I integrate them to PaMIR but it still cannot generalize well in the wild images. I notice the three terms of feature are all with certain geometry property, either distance in $R^3$ or surface normal. Is it possible that pure geometry feature encoding is better to narrow the gap between input of implicit function and occupancy field to be learned? Or, if the 2D and 3D receptive field are closed to 1, previous sotas still cannot be improved?

YuliangXiu / ICON

About the proposed method. #47