About object detection - Githubissues

I think that you push below token in llm

['<cls>', '<x1>', '<y1>', '<x2>', '<y2>', '<cls>', '<x1>', '<y1>', '<x2>', '<y2>', '<cls>', '<x1>', '<y1>', '<x2>', '<y2>', ...]

about object detection loss, did you use hungarian matching like detr?

Or if you use just next token prediction by cross entropy loss, how to sort the ground-truth box?

OpenGVLab / VisionLLM