facebookresearch / DensePose

A real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body
http://densepose.org
Other
6.98k stars 1.3k forks source link

Questions about constants in code (body_uv_rcnn_heads.py) and transformation "AnnIndex_lowres" to "AnnIndex". #213

Open vlad-filin opened 5 years ago

vlad-filin commented 5 years ago

Hello! Thank you for providing code, it gives a chance to fully understand how model works.

I have several questions about constants mentioned in body_uv_rcnn_heads.py file. They have no description or even name, just a number in code (e.g. line 26 number 15). Questions: 1) Line 26 : " model.ConvTranspose(blob_in, 'AnnIndex_lowres'+pref, dim, 15,...". I have a guess that 15 stands for number of annotations classes (14) + 1 (background). It would be nice to make it a config parameter(like BODY_UV_RCNN.NUM_PATCHES), or at least highlight the meaning of this constant in comments in body_uv_rcnn_heads.py 2) Line 65 " ### Now reshape UV blobs, such that they are 1x1x(196 NumSamples)xNUM_PATCHES" and line 70 " ... , shape=(-1,cfg.BODY_UV_RCNN.NUM_PATCHES+1,196))". In article "Dense Human Pose Estimation In The Wild" it was mentioned that there are <= 14 points per one part of body, and there are 14 semantic parts of body in COCO DensePose Dataset, so i have a guess that it stands for max points all semantic parts, but i am not sure about this. It would be nice to provide this constant(196) a description.

I also have a question about transformation "AnnIndex_lowres" to "AnnIndex". This transfromation is done via bilinear interpolation and semantically shouldn't change the number of tensor's channel( and for transformations "Index_UV_lowres" to "Index_UV", "U_lowres" to "U_estimated", "V_lowres" to "V_estimated" number of channels is immutable). But at the same time:

at line 26: model.ConvTranspose(blob_in, 'AnnIndex_lowres'+pref, dim, 15,cfg.BODY_UV_RCNN.DECONV_KERNEL, pad=int(cfg.BODY_UV_RCNN.DECONV_KERNEL / 2 - 1), stride=2, weight_init=(cfg.BODY_UV_RCNN.CONV_INIT, {'std': 0.001}), bias_init=('ConstantFill', {'value': 0.}))

at line 46: blob_Ann_Index = model.BilinearInterpolation('AnnIndex_lowres'+pref, 'AnnIndex'+pref, cfg.BODY_UV_RCNN.NUM_PATCHES+1 , cfg.BODY_UV_RCNN.NUM_PATCHES+1, cfg.BODY_UV_RCNN.UP_SCALE)

So, I have questions: 3) in docs of detector.BilinearInterpolation ( detector.py lines 330 -334) mentioned that number of input channels is equal to number of output channels, but at the same time input blob "AnnIndex_lowres" has 15 channels, and output blob "AnnIndex" has 25 channels.How is this possible? I am not familiar with caffe2, but BilinearInterpolation in this project is implemented as ConvTranspose layer with fixed weights. 4) Why number of output channels of "AnnIndex" must be equal to cfg.BODY_UV_RCNN.NUM_PATCHES+1 (in COCO DensePose dataset there are 14 semantic classes for masks)?

I also provide part of log in which this change of channels are highlighted. This log was created by running "python2 tools/train_net.py --cfg configs/DensePose_ResNet50_FPN_single_GPU.yaml OUTPUT_DIR /tmp/detectron-output".

INFO net.py: 241: body_conv_fcn8 : (3, 512, 14, 14) => AnnIndex_lowres : (3, 15, 28, 28) ------- (op: ConvTranspose) INFO net.py: 241: body_conv_fcn8 : (3, 512, 14, 14) => Index_UV_lowres : (3, 25, 28, 28) ------- (op: ConvTranspose) INFO net.py: 241: body_conv_fcn8 : (3, 512, 14, 14) => U_lowres : (3, 25, 28, 28) ------- (op: ConvTranspose) INFO net.py: 241: body_conv_fcn8 : (3, 512, 14, 14) => V_lowres : (3, 25, 28, 28) ------- (op: ConvTranspose) INFO net.py: 241: AnnIndex_lowres : (3, 15, 28, 28) => AnnIndex : (3, 25, 56, 56) ------- (op: ConvTranspose) INFO net.py: 241: Index_UV_lowres : (3, 25, 28, 28) => Index_UV : (3, 25, 56, 56) ------- (op: ConvTranspose) INFO net.py: 241: U_lowres : (3, 25, 28, 28) => U_estimated : (3, 25, 56, 56) ------- (op: ConvTranspose) INFO net.py: 241: V_lowres : (3, 25, 28, 28) => V_estimated : (3, 25, 56, 56) ------- (op: ConvTranspose)

Thank you for your time and hope to hear from you soon!

Johnqczhang commented 5 years ago

@vlad-filin Hi, @penincillin and I had discussions in #200 regarding similar questions with yours. Specifically,

  1. Yes. In line26 of body_uv_rcnn_heads.py, 15 indicates the number of semantically meaningful body parts used to sample points for annotators, as mentioned in the CVPR 2018 paper. For a better understanding, I made it as a configurable parameter named BODY_UV_RCNN.NUM_SEMANTIC_PARTS (like BODY_UV_RCNN.NUM_PATCHES) in config.py.

  2. I have verified that the maximum annotated points within a person bounding box is "184" instead of 196 over all datasets (train, valminusminival, minival). In my opinion, the number "196" stands for the number of pixels in a feature map output by RoIAlign, which will be taken as inputs to the "body_uv_rcnn" head network. The evidence to support my point not only exists in that the feature map size is exactly 14 x 14 (which was 7 x 7 previously in Faster-RCNN and Mask-RCNN), but also can be found in the function add_body_uv_rcnn_blobs() where targets blobs with shape (num_fg_rois, 196) of body UV supervisions for a minibatch are constructed, and pool_points_interp.cu which is the GPU implementation of the operator PoolPointsInterp that is used to bilinearly interpolate points from estimated heatmaps (Index_UV, U_estimated and V_estimated) in body_uv_rcnn_heads.py before computing the SoftmaxLoss for patch index classification and the SmoothL1Loss for UV coordinates regression.

  3. My last comment in #200 may give you a hint on your third question.

  4. So far, I am also not pretty sure why the number of output channels of AnnIndex must be equal to BODY_UV_RCNN.NUM_PATCHES + 1. Since its output is used to compute a SpatialSoftmaxLoss, so I guess the number of output channels can also be set as BODY_UV_RCNN.NUM_SEMANTIC_PARTS + 1. However, I did find an evidence for such a design. In the inference of DensePose, there is a post-processing step which multiplies (in an element-wise way) the estimated patch index heat maps Index_UV with binary dense masks computed by AnnIndex, which requires that the tensor shapes of these two outputs must be equal.

Last but not least, I think the author left behind many experimental codes in this released version, which inevitably makes it difficult for many of us to understand some related and important implementations. Therefore, I took a few days to deeply study the original codes in this repo and some dependent API functions from Detectron and Caffe2. With my deeper understanding, I refined the related codes and fixed some minor bugs regarding this issue and other issues (#191, #194, #200, #202, #203, #206, #211) in my repo. I haven't finished training the baseline model using the refined version yet. So I will give more updates in a new issue or create a PR after my modifications will be fully verified.

Johnqczhang commented 5 years ago

@vlad-filin Hi, @penincillin and I had discussions in #200 regarding similar questions with yours. Specifically,

  1. Yes. In line26 of body_uv_rcnn_heads.py, 15 indicates the number of semantically meaningful body parts used to sample points for annotators, as mentioned in the CVPR 2018 paper. For a better understanding, I made it as a configurable parameter named BODY_UV_RCNN.NUM_SEMANTIC_PARTS (like BODY_UV_RCNN.NUM_PATCHES) in config.py.
  2. I have verified that the maximum annotated points within a person bounding box is "184" instead of 196 over all datasets (train, valminusminival, minival). In my opinion, the number "196" stands for the number of pixels in a feature map output by RoIAlign, which will be taken as inputs to the "body_uv_rcnn" head network. The evidence to support my point not only exists in that the feature map size is exactly 14 x 14 (which was 7 x 7 previously in Faster-RCNN and Mask-RCNN), but also can be found in the function add_body_uv_rcnn_blobs() where targets blobs with shape (num_fg_rois, 196) of body UV supervisions for a minibatch are constructed, and pool_points_interp.cu which is the GPU implementation of the operator PoolPointsInterp that is used to bilinearly interpolate points from estimated heatmaps (Index_UV, U_estimated and V_estimated) in body_uv_rcnn_heads.py before computing the SoftmaxLoss for patch index classification and the SmoothL1Loss for UV coordinates regression.
  3. My last comment in #200 may give you a hint on your third question.
  4. So far, I am also not pretty sure why the number of output channels of AnnIndex must be equal to BODY_UV_RCNN.NUM_PATCHES + 1. Since its output is used to compute a SpatialSoftmaxLoss, so I guess the number of output channels can also be set as BODY_UV_RCNN.NUM_SEMANTIC_PARTS + 1. However, I did find an evidence for such a design. In the inference of DensePose, there is a post-processing step which multiplies (in an element-wise way) the estimated patch index heat maps Index_UV with binary dense masks computed by AnnIndex, which requires that the tensor shapes of these two outputs must be equal.

Last but not least, I think the author left behind many experimental codes in this released version, which inevitably makes it difficult for many of us to understand some related and important implementations. Therefore, I took a few days to deeply study the original codes in this repo and some dependent API functions from Detectron and Caffe2. With my deeper understanding, I refined the related codes and fixed some minor bugs regarding this issue and other issues (#191, #194, #200, #202, #203, #206, #211) in my repo. I haven't finished training the baseline model using the refined version yet. So I will give more updates in a new issue or create a PR after my modifications will be fully verified.

All modifications and refinements can be seen in PR #215. Here's a comparison of results on densepose_coco_minival dataset for the baseline model (ResNet50_FPN_s1x) between the "before" and "after" modification. (I have only 2 GPUs, so some results are slightly lower than that reported by the author here):

lizenan commented 5 years ago

Hi @Johnqczhang, I think there is a problem in Annindex_lowers and AnnIndex. the output shape of AnnIndex_lowers is 15. However, if you check the code:blob_Ann_Index = model.BilinearInterpolation('AnnIndex_lowres'+pref, 'AnnIndex'+pref, cfg.BODY_UV_RCNN.NUM_PATCHES+1 , cfg.BODY_UV_RCNN.NUM_PATCHES+1, cfg.BODY_UV_RCNN.UP_SCALE).

cfg.BODY_UV_RCNN.NUM_PATCHES+1 is equal to 25. for AnnIndex, its input dim and output dim are the same : 25. Moreover, if you check the pre-trained file. the shape of AnnIndex's weight is (25,25,4,4). Which is not very possible to use AnnIndex_lowers directly as AnnIndex's input because its output dim is 15.

Johnqczhang commented 5 years ago

Hi @lizenan, sorry for the late reply. As I discussed with @penincillin in #200 (you can see my last comment), for AnnIndex, its input dim and output dim must be the same by the current implementation of BilinearInterpolation which essentially is a ConvTranspose layer, where you can find that its output dim is determined by the number of input kernels and is not related with the input_blob size (in this case, i.e., AnnIndex_lowres). So as long as you specified the same number of the input dim and output dim of AnnIndex like the following code, you can specify any number for the output dim of AnnIndex_lowres, except that values in extra channels are all 0s.

blob_Ann_Index = model.BilinearInterpolation('AnnIndex_lowres'+pref, 'AnnIndex'+pref, cfg.BODY_UV_RCNN.NUM_PATCHES+1 , cfg.BODY_UV_RCNN.NUM_PATCHES+1, cfg.BODY_UV_RCNN.UP_SCALE).

For AnnIndex_lowres, I still can't find any clear explanation about why its output dim is 15 instead of 25. So I replaced its output dim from 15 to 25 to keep the same number with the input dim of AnnIndex such that all channels have response now. After I trained ResNet50_FPN_s1x models using the same hyperparameter setting as the author but different output dim of AnnIndex_lowres, here are results I got in the test set (densepose_coco_minival2014):

Model AP AP50 AP75 APm APl
model in MODEL_ZOO (out_dim=15) 0.4748 0.8368 0.4820 0.4262 0.4948
my reproduced model (out_dim=15) 0.4717 0.8377 0.4821 0.4044 0.4928
my reproduced model (out_dim=25) 0.4764 0.8422 0.4928 0.4207 0.4950
jayantsharma commented 4 years ago

The index 15 is because they use 14 segmented body parts to collect annotations (before any surface correspondence is imposed). This is used as an auxiliary loss during training and supervised using dp_masks (segmentation masks from annotation stage 1). More details here:

https://github.com/facebookresearch/DensePose/blob/master/notebooks/DensePose-COCO-Visualize.ipynb