Closed xuchengggg closed 5 years ago
See #19.
It's a bug, I forgot to change h to a division sign. I've fixed this for v1.1 but for backward comparability I've left it in for now. You can fix it by changing that line to h = scale / ar / cfg.max_size
.
use_pixel_scales
is supposed to just use raw pixel dimensions instead of that scale thing in the comment you quoted. When it's on, a scale of 24 means 24x24 pixels. If it was off, 24 for P3 would mean 191x191 yeah (but obviously don't turn it off because that doesn't make any sense).
The scales [24, 48, 96, 192, 384] which just indicate the size of anchors in the original image, for example, for P3, if the ratio is 1, then the anchor would be 24x24px relative to the original image, and the purpose of h = scale / ar / cfg.max_size is for nomalization. Is this understanding correct?
In addition why in your code, for mask coefficients prediction the activation function is ReLU instead of the tanh mentioned in the paper?
Ignore all the relative stuff if using use_pixel_scales
, it doesn't apply. No matter the size of the image, if use_pixel_scales
is on, then the P3 will get 24x24 px anchors. If it's off then yes, everything's relative to the image size (i.e., a bigger image will have bigger anchors) but I don't plan to turn that setting off anymore.
scale / ar
gets the height in pixels for the given anchor (so a post-sqrt ar of 2 and a 24 x 24 base anchor would be 48 x 12 px). The / cfg.max_size
just makes it relative to the size of the image because that the output of the prediction heads. Note that because of this, if the image is bigger, the relative size is actually smaller. Hmm now that I think about it, maybe that behavior is not so desirable after all? I might look into this further.
The activation is tanh or else mask coefficients wouldn't work at all (they need negatives). Where do you see ReLU?
Thank you for your quick reply which help me understand this better. And I check the activation part again, I got it wrong, i am very embarrassed to delay your time.
About head architecture, is this change made in the code?
-----------------------> class
x-> conv-> conv -> box to
-----------------------> mask
-------------> conv -> class x-> conv-> conv -> box -------------> conv -> mask The horizontal line in front is just for alignment
You mean you want to add extra layers before the class and mask convs? The current head net looks like this:
+---> class
|
x--> conv--> box
|
+---> mask
where class, box, and mask are convs of their own.
If you want to add extra layers, check out the extra_layers
config setting. For your second diagram that would be (2, 1, 1)
.
Sorry, I didn't say it clearly. I mean in PredictionModule, feature maps (P3,P4,P5...) first go through a src.upfeature which is a conv with kernel size (3,3) and then in yolact.py line 209, pass 3 convolutions separately and finally go through conv to get the class, box and mask, which is like the second architecture I mentioned in the last question. Or my understanding of code is wrong, I am really unfamiliar with pytorch. Thank you very much for answering my question all the time
Ah yes, so that's what I meant by where class, box, and mask are convs of their own
. Maybe it's clearer if I just write it like:
+---> conv--> class
|
x--> conv--> conv--> box
|
+---> conv--> mask
There's one shared conv and then 1 conv for each by default (so it's like you said).
yes. So why change the architecture from paper to this one, does this have better performance? As for time spent, this one should be more time consuming ?
That is the one in the paper. Each 3d rectangle in the paper is its own conv layer (except for the first layer coming from the backbone). In the figure, the first layer is x
, the second layer is that first conv, and then the last 3 layers are one conv for each of box, class, and mask.
In your code, x is the fpn feature maps(P3,P4...) and then use upfeature x = src.upfeature(x)
, and then bbox_x = src.bbox_extra(x)
, conf_x = src.conf_extra(x)
, mask_x = src.mask_extra(x)
, and then bbox = src.bbox_layer(bbox_x)
, ...
+---> conv--> conv(class)
|
x--> conv--> conv--> conv(box)
|
+---> conv--> conv(mask)
but in your paper, there is no detailed explanation, just look at the picture, I would think the architecture like
+--> conv(class)
|
x-->conv--> conv--> conv(box)
|
+--> conv(mask)
I think you're missing the fact that conf_extra
, bbox_extra
, and mask_extra
are all the identity (lambda x: x
); i.e., they do nothing because I have extra_layers set to (0, 0, 0)
.
0.0 right, I forget to check the parameter cfg.extra_layers, really embarrassed to waste your time. Thanks a lot. I am trying to rewrite the code in tensorflow. thank you very much for your help.
No problem, it's good to have this kind of discussion because I could have a bug somewhere (in fact, square anchors were one such bug). In the paper I wrote what I thought I implemented, but I could have implemented something different by accident. Think of it as code peer review :^ )
In the make_priors part, I am not clearly understand the meaning of backbone.use_pixel_scales, in this case, w will be equal to h, does this mean anchors are all square and the aspect ratios just change the size of anchors? And your instance says that" If this layer has convouts of size 30x30 for an image of size 600x600, the 'default' (scale of 1) for this layer would produce bounding boxes with an area of 20x20px. If the scale is .5 on the other hand, this layer would consider bounding boxes with area 10x10px, etc.", so for P3, the size of feature map is 6969 with scales of 24, if the size of image is 550550, the area of bounding box in this layer would be 191*191px?