CogCom will result in how much increase in training costs and inference time?

Thanks for the great work！ I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?

Hi, thanks for your interest! Compared to VLMs trained on single-image input, each CoM chain may consists of multiple turns of image-text pairs, which could linearly increase the training and inference time. We have restricted the maximum turns to <= 3 in the data processor. And in fact, many CoM chain can reach the answer by re-inputting the image after a single CropZoomIn manipulation on the original image.

THUDM / CogCoM

CogCom will result in how much increase in training costs and inference time? #11