IIGROUP / MANIQA

[CVPRW 2022] MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment
Apache License 2.0
307 stars 36 forks source link

Koniq10k dataloader do resize to (224, 224) and then apply transform with random crop? Why? #34

Open KarenAssaraf opened 1 year ago

KarenAssaraf commented 1 year ago

Hey! I see in this line: https://github.com/IIGROUP/MANIQA/blob/b286649f0d7656a0a3e8e9b0ff092281b2ce27bb/train_maniqa.py#L248, that when training on koniq10k, each image is first resized to (224, 224). Then you apply transform function that contains a random crop to size (224, 224). Unless I miss something does the original image has to be resized?

Thanks!

Stephen0808 commented 1 year ago

We select vision transformer as our feature extractor, which means the input images should be resized to the fixed image size(224X224).

KarenAssaraf commented 1 year ago

Hi @Stephen0808 ! Thanks for your fast answer. But then what is the effect of random crop in the transform function of the dataloader?

Also, it means all Koniq10k images, which are initially, full resolution are resized to (224, 224). We loose the information of the full resolution image quality. (Usually IQA tramnsformers try to avoid resizing and leverage transformers architecture to accept different input sizes) Wouldn't it be better performance if we random crop first koniq10k dataset images, and then send to vit? lets say we make sure at least 20 crops are in the dataset for each image, and each crop would have the label of the full resolution image for the training. It would mean training on size(koniq10)*num_of_crops images.

What do you think?

Stephen0808 commented 1 year ago

As mentioned in your question, we crop several images(224X224) for inference and average the scores to get the final score.

KarenAssaraf commented 1 year ago

I mean:

the question is: why not same process for training/inference?

Stephen0808 commented 1 year ago

Both in the inference and training phases, we used cropped images.

KarenAssaraf commented 1 year ago

Ok maybe I misunderstood something. From my understanding: in training, there is :

  1. resizing (so it's not a crop and it affects image quality) from initial resolution to (224, 224)
  2. and then crop from (224, 224) to (224, 224). Which does nothing since the input is already (224,224)? correct?

So I was wondering why instead of step1, there are not several crops then send this crops to vit