Open FaltingsA opened 8 months ago
Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?
Hi, thanks for your interest! Compared to VLMs trained on single-image input, each CoM chain may consists of multiple turns of image-text pairs, which could linearly increase the training and inference time. We have restricted the maximum turns to <= 3 in the data processor. And in fact, many CoM chain can reach the answer by re-inputting the image after a single CropZoomIn manipulation on the original image.
Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?