resolution in training - Githubissues

Zdxfgre commented 3 months ago

Hi, as mentioned in the text, the resolution of the images is retained during training. Could you please point out which part of the code achieves this?

zhiyuanyou commented 3 months ago

Hello,

The codes are in the following lines.

The resolution is limited by the computational burden. We need to set max / min size according to the GPU type: https://github.com/XPixelGroup/DepictQA/blob/main/src/model/clip/clip.py#L120
We achieve adjustable resolution by interpolating position embedding: https://github.com/XPixelGroup/DepictQA/blob/main/src/model/clip/model_clip.py#L248

"resize" means the maximum size for the short edge. If the image is larger, it will be resized while keeping ratio.
"max_size" means the maximum size of the long edge. If the image is larger, it will be resized while keeping ratio.
"min_size" means the minimum size of the image. If the image is smaller, it will be padded while keeping ratio.

In current released model, "resize" could be set as 300+ with A6000 GPU, and 600+ with A100 GPU. "max_size" is usually set as 2X "resize".

The images in our constructed datasets fall into this resolution range, thus their resolutions are all retained.

In our next release, with 4090 GPU, "resize" could be set as 1024, and "max_size" could be set as 2048.

Zdxfgre commented 3 months ago

Hello,

The codes are in the following lines.

The resolution is limited by the computational burden. We need to set max / min size according to the GPU type: https://github.com/XPixelGroup/DepictQA/blob/main/src/model/clip/clip.py#L120

We achieve adjustable resolution by interpolating position embedding: https://github.com/XPixelGroup/DepictQA/blob/main/src/model/clip/model_clip.py#L248

"resize" means the maximum size for the short edge. If the image is larger, it will be resized while keeping ratio.

"max_size" means the maximum size of the long edge. If the image is larger, it will be resized while keeping ratio.

"min_size" means the minimum size of the image. If the image is smaller, it will be padded while keeping ratio.

In current released model, "resize" could be set as 300+ with A6000 GPU, and 600+ with A100 GPU. "max_size" is usually set as 2X "resize".

The images in our constructed datasets fall into this resolution range, thus their resolutions are all retained.

In our next release, with 4090 GPU, "resize" could be set as 1024, and "max_size" could be set as 2048.

Thanks for your detailed explanation. So, if I have a batch of images with resolutions greater than 1024x1024 and I want to train with an input resolution greater than 1024x1024, I should set the “resize” to at least 1024 and set the “max_size” to another larger value based on the actual situation of the images. Is that correct?

zhiyuanyou commented 3 months ago

Yes. But currently it suffers from CUDA out of memory error, since 1024 is too large.

You should add a simple vision abstractor (we do not release the code now, but it is similar to the one used in https://github.com/Q-Future/Co-Instruct) to reduce the number of vision tokens.

Zdxfgre commented 3 months ago

Yes. But currently it suffers from CUDA out of memory error, since 1024 is too large.

You should add a simple vision abstractor (we do not release the code now, but it is similar to the one used in https://github.com/Q-Future/Co-Instruct) to reduce the number of vision tokens.

Now I understand. Thank you very much for your detailed explanation : )

zhiyuanyou commented 1 month ago

Hello,

I am glad to tell you that the vision abstractor has been incorporated into our code and the pre-trained model has also been released.

Now the default maximum resolution is 1024 X 2048.

XPixelGroup / DepictQA

resolution in training #7