Open KarenAssaraf opened 1 year ago
We select vision transformer as our feature extractor, which means the input images should be resized to the fixed image size(224X224).
Hi @Stephen0808 ! Thanks for your fast answer. But then what is the effect of random crop in the transform function of the dataloader?
Also, it means all Koniq10k images, which are initially, full resolution are resized to (224, 224). We loose the information of the full resolution image quality. (Usually IQA tramnsformers try to avoid resizing and leverage transformers architecture to accept different input sizes) Wouldn't it be better performance if we random crop first koniq10k dataset images, and then send to vit? lets say we make sure at least 20 crops are in the dataset for each image, and each crop would have the label of the full resolution image for the training. It would mean training on size(koniq10)*num_of_crops images.
What do you think?
As mentioned in your question, we crop several images(224X224) for inference and average the scores to get the final score.
I mean:
the question is: why not same process for training/inference?
Both in the inference and training phases, we used cropped images.
Ok maybe I misunderstood something. From my understanding: in training, there is :
So I was wondering why instead of step1, there are not several crops then send this crops to vit
Hey! I see in this line: https://github.com/IIGROUP/MANIQA/blob/b286649f0d7656a0a3e8e9b0ff092281b2ce27bb/train_maniqa.py#L248, that when training on koniq10k, each image is first resized to (224, 224). Then you apply transform function that contains a random crop to size (224, 224). Unless I miss something does the original image has to be resized?
Thanks!