THUDM / CogCoM

Other
152 stars 10 forks source link

CogCom will result in how much increase in training costs and inference time? #11

Open FaltingsA opened 8 months ago

FaltingsA commented 8 months ago

Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?

qijimrc commented 8 months ago

Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?

Hi, thanks for your interest! Compared to VLMs trained on single-image input, each CoM chain may consists of multiple turns of image-text pairs, which could linearly increase the training and inference time. We have restricted the maximum turns to <= 3 in the data processor. And in fact, many CoM chain can reach the answer by re-inputting the image after a single CropZoomIn manipulation on the original image.