What is the expected dimension sizes for the outputs dictionary from sem_seg_head?

SouLeo commented 9 months ago

Hello,

I have been working on adapting your oneformer model to a custom dataset and although I have it training without error, I'm confused about the dimensions of tensors from the outputs dictionary here: https://github.com/SHI-Labs/OneFormer/blob/4962ef6a96ffb76a76771bfa3e8b3587f209752b/oneformer/oneformer_model.py#L280C24-L280C36

My supervised training set has inputs [3, 256, 256] and ground truths/labels[3, 256, 256] but for whatever reason none of the dictionary items, e.g., outputs['pred_logits'] or outputs['pred_masks'] share the dimensions of the labels.

[edit: including the shapes of my input image and the corresponding pred logits and pred masks shapes after running through the sem_seg_head with swin backbone]

My questions are (1) what would the expected tensor shapes for those objects be -- so that I can validate my training is working; (2) where would I access the actual predictions made by the model that are the same shape as my ground truth labels for the purpose of evaluating standard metrics?

SouLeo commented 9 months ago

I re-ran the original oneformer model using the ade20k dataset with swin backbone. I used the config file oneformer_swin_large_bs16_160k.yaml and I see similar inconsistencies

Where I printed the image shape using print(images[0].shape) after this line: https://github.com/SHI-Labs/OneFormer/blob/4962ef6a96ffb76a76771bfa3e8b3587f209752b/oneformer/oneformer_model.py#L274

And the mask cls and mask pred shapes are printed after this line: https://github.com/SHI-Labs/OneFormer/blob/4962ef6a96ffb76a76771bfa3e8b3587f209752b/oneformer/oneformer_model.py#L309

with print(mask_cls_results.shape) and print(mask_pred_results.shape)

My point here is that the pred shapes are not guaranteed to match my input image resolution. So at what point in the model do these shapes match? Where should I begin evaluation?

Also, despite the mask_pred_results being different heights and widths than the original input images, the model does seem to produce outputs that are in the ballpark of the original image (when run on the ade20k dataset and config). Meanwhile, with my custom dataset, the outputs are 64 x 64 when I need them to be 256 x 256. Would you expect your model to behave this way? Or would you expect that my configuration and custom dataset may be wrong?

SouLeo commented 9 months ago

I'm still scratching my head trying to figure this out. In a previous Issue, you suggested to extract the correct masks as follows: https://github.com/SHI-Labs/OneFormer/issues/82#issuecomment-1667913017

specifically with specific_categiry_mask = (sem_seg == category_id).float()

but using the ade20k config file, I noticed the output of sem seg is:

Looking at the unique values output by this matrix, how could (sem_seg == category_id) possibly be valid? Not just because category_id are ints and the values in sem_seg are floats, but also because the range of category_ids for ade20k are greater than 3?

To reiterate, I am genuinely confused about this model, its outputs, and what would be its expected behaviors.

SouLeo commented 9 months ago

After comparing the model input format documentation for detectron2 with the provided example for semantic segmentation custom dataset mappers I noticed a mismatch.

In detectron2 they expect the values of sem_seg to be class labels with the ground truth resolution as [H, W]. However, following your custom mapper class, you use "instance" like labels, such that, the gt_masks are [N,H,W].

I have not tested this; but if this is the issue, I would highly recommend adding additional documentation for this.

SHI-Labs / OneFormer

What is the expected dimension sizes for the outputs dictionary from sem_seg_head? #93