dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5k stars 1.33k forks source link

scales question #61

Closed xuchengggg closed 5 years ago

xuchengggg commented 5 years ago

In the make_priors part, I am not clearly understand the meaning of backbone.use_pixel_scales, in this case, w will be equal to h, does this mean anchors are all square and the aspect ratios just change the size of anchors? And your instance says that" If this layer has convouts of size 30x30 for an image of size 600x600, the 'default' (scale of 1) for this layer would produce bounding boxes with an area of 20x20px. If the scale is .5 on the other hand, this layer would consider bounding boxes with area 10x10px, etc.", so for P3, the size of feature map is 6969 with scales of 24, if the size of image is 550550, the area of bounding box in this layer would be 191*191px?

dbolya commented 5 years ago

See #19.

It's a bug, I forgot to change h to a division sign. I've fixed this for v1.1 but for backward comparability I've left it in for now. You can fix it by changing that line to h = scale / ar / cfg.max_size.

use_pixel_scales is supposed to just use raw pixel dimensions instead of that scale thing in the comment you quoted. When it's on, a scale of 24 means 24x24 pixels. If it was off, 24 for P3 would mean 191x191 yeah (but obviously don't turn it off because that doesn't make any sense).

xuchengggg commented 5 years ago

The scales [24, 48, 96, 192, 384] which just indicate the size of anchors in the original image, for example, for P3, if the ratio is 1, then the anchor would be 24x24px relative to the original image, and the purpose of h = scale / ar / cfg.max_size is for nomalization. Is this understanding correct?

In addition why in your code, for mask coefficients prediction the activation function is ReLU instead of the tanh mentioned in the paper?

dbolya commented 5 years ago

Ignore all the relative stuff if using use_pixel_scales, it doesn't apply. No matter the size of the image, if use_pixel_scales is on, then the P3 will get 24x24 px anchors. If it's off then yes, everything's relative to the image size (i.e., a bigger image will have bigger anchors) but I don't plan to turn that setting off anymore.

scale / ar gets the height in pixels for the given anchor (so a post-sqrt ar of 2 and a 24 x 24 base anchor would be 48 x 12 px). The / cfg.max_size just makes it relative to the size of the image because that the output of the prediction heads. Note that because of this, if the image is bigger, the relative size is actually smaller. Hmm now that I think about it, maybe that behavior is not so desirable after all? I might look into this further.

The activation is tanh or else mask coefficients wouldn't work at all (they need negatives). Where do you see ReLU?

xuchengggg commented 5 years ago

Thank you for your quick reply which help me understand this better. And I check the activation part again, I got it wrong, i am very embarrassed to delay your time.

About head architecture, is this change made in the code?
-----------------------> class x-> conv-> conv -> box to -----------------------> mask

-------------> conv -> class x-> conv-> conv -> box -------------> conv -> mask The horizontal line in front is just for alignment

dbolya commented 5 years ago

You mean you want to add extra layers before the class and mask convs? The current head net looks like this:

       +---> class
       |
x--> conv--> box
       |
       +---> mask

where class, box, and mask are convs of their own.

If you want to add extra layers, check out the extra_layers config setting. For your second diagram that would be (2, 1, 1).

xuchengggg commented 5 years ago

Sorry, I didn't say it clearly. I mean in PredictionModule, feature maps (P3,P4,P5...) first go through a src.upfeature which is a conv with kernel size (3,3) and then in yolact.py line 209, pass 3 convolutions separately and finally go through conv to get the class, box and mask, which is like the second architecture I mentioned in the last question. Or my understanding of code is wrong, I am really unfamiliar with pytorch. Thank you very much for answering my question all the time

dbolya commented 5 years ago

Ah yes, so that's what I meant by where class, box, and mask are convs of their own. Maybe it's clearer if I just write it like:

       +---> conv--> class
       |
x--> conv--> conv--> box
       |
       +---> conv--> mask

There's one shared conv and then 1 conv for each by default (so it's like you said).

xuchengggg commented 5 years ago

yes. So why change the architecture from paper to this one, does this have better performance? As for time spent, this one should be more time consuming ?

dbolya commented 5 years ago

That is the one in the paper. Each 3d rectangle in the paper is its own conv layer (except for the first layer coming from the backbone). In the figure, the first layer is x, the second layer is that first conv, and then the last 3 layers are one conv for each of box, class, and mask.

xuchengggg commented 5 years ago

In your code, x is the fpn feature maps(P3,P4...) and then use upfeature x = src.upfeature(x), and then bbox_x = src.bbox_extra(x), conf_x = src.conf_extra(x), mask_x = src.mask_extra(x), and then bbox = src.bbox_layer(bbox_x) , ...

       +---> conv--> conv(class)
       |
x--> conv--> conv--> conv(box)
       |
       +---> conv--> conv(mask)

but in your paper, there is no detailed explanation, just look at the picture, I would think the architecture like

               +--> conv(class)
               |
x-->conv--> conv--> conv(box)
               |
               +--> conv(mask)
dbolya commented 5 years ago

I think you're missing the fact that conf_extra, bbox_extra, and mask_extra are all the identity (lambda x: x); i.e., they do nothing because I have extra_layers set to (0, 0, 0).

xuchengggg commented 5 years ago

0.0 right, I forget to check the parameter cfg.extra_layers, really embarrassed to waste your time. Thanks a lot. I am trying to rewrite the code in tensorflow. thank you very much for your help.

dbolya commented 5 years ago

No problem, it's good to have this kind of discussion because I could have a bug somewhere (in fact, square anchors were one such bug). In the paper I wrote what I thought I implemented, but I could have implemented something different by accident. Think of it as code peer review :^ )