Open bluesky314 opened 4 years ago
Hi, yes the changes you have listed contribute to the better accuracy of our model. Except pyramid pooling is similar to aspp in the AlphaGAN paper. In the paper we provide the measurements for the effect of batchsize, GN+WS, and clipping. Comparing model (1) in Table 2. to the rest in Table 4. we see that it matches the state of the art, and with the changes made in models (1-7) we improve accuracy,
Also to look at is our strong data-augmentation, using the ETZH synthesizability dataset as an additional source of background, and using multiple scales for each image during training. The RAdam optimizer, and our training schedule. There were many components to this work.
Although the key take-away is still that great F,B, Alpha prediction can be gotten simultaneously without any drawbacks.
We will release a clear version of the full code for training once the paper is accepted.
Thank you for your response. I don't think you mentioned the ETZH synthesizability dataset in your paper but that's a great idea. I though of something similar where I was getting the mean color in the trimap region and composting that on a background of that color to make it harder for my model.
For augmentation you have only mentioned "For data augmentation, we adopt random flip, mirroring, gamma, and brightness augmentations" which are standard.
Yes, F,B, Alpha prediction can be gotten simultaneously but I was surprised to see it did not yield to much gain for alpha prediction as it did in Context Aware Image Matting. You achived SOTA without it with training implementation.
Thanks for pointing that out, I will need to add that in the final version of the paper.
Your bg color idea is a nice one but it's a balance between the training examples being too easy and being too difficult.
True those augmentation are common, but they all have a large effect. I was surprised to see gamma augmentation worked very well.
Our architecture differs from Context Aware Matting in that they have a separate foreground prediction branch, which would give higher accuracy. So it's hard to compare directly. Nonetheless the benefit they gain from foreground prediction is very minor. MSE of 8.8 instead of 9.0. (Models (4)vs. (6).)
There's much research to be done in the area :)
Thanks. Can you also give me any intution you have about why batch size = 1 would make such a difference? You compared GN+WS on larger batch size to BN and isolated batch size to be the factor that lead to performance increase on bs=1. I'm a bit confused as isint a higher batch size with GN+WS just a parallel version of GN+WS with bs=1? How could it make a difference in that case as the GN/WS parameters are independant of the batch and the gradient is the same gradient as before just an averaged across the samples.
I'm sorry I don't have a great theoretical explanation for this yet. It is something I have observed empirically with my model and with the Deep Image Matting network. I will point out that a greater batch size is not a parallel version of batch size=1. This is since the network weights are updated between each sample.
It is possible that the alpha matting task varies widely between different foreground objects. So including each of those in a batch has conflicting effects on the gradient.
I'm confusing about multi-scale training you mentioned above. Does it mean "Additionally, every second mini-batch is sampled from the 2× image of the previous image, so as to increase the invariance to scale", which has been written in section 3.6? If that, every second mini-batch resample from the upsampled image which used in previous mini-batch. "Resample" means that randomly cropping image with paths and re-generate trimap by "random erosion and dilation of 3 to 25 pixels", which you mentioned in your paper.
In Section 4.1 you say "The laplacian loss proposed by Hou et al. [13] gives a significant reduction in errors across all metrics. We note this network, training and loss configuration is enough to achieve state of the art results, see Table 4."
I am unsure as to what differentiated your model from others and lead to quite a substantial new state of art as by this point you have not used fg/bg prediction nor data augmentation. From what I gather it can be attributed to:
Is there anything I am missing? I am only refering to the Ours_alpha in Table 4. Can you give some comment on why you think this performed so much better than previous models listed in table 4?