TextMonkey问题 - Githubissues

Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models

MIT License

1.82k stars 128 forks source link

TextMonkey问题 #82

Closed songyanbei closed 6 months ago

songyanbei commented 6 months ago

WechatIMG111039

作者你好，在看论文和训练数据时，我对论文中的table1部分有两个疑问，如上图红框所示。问题1: table1里的Text Recognition的box是(x1,y1),(x2,y2)两个点，但我发现在实际的json中只有中心点坐标，这里的描述和应用是否存在差异。问题2：关于坐标归一化的描述是否有误，是x/hr还是y/hr?

MelosY commented 6 months ago

Sorry for the misunderstanding. The first question, in the experiments in the paper, we all used the structure of the table for training. But the paper also says that point coordinates can further improve performance, so in the open source version, we provide points as json. As for the second, x here refers to the coordinates of the height.

pikerbright commented 2 months ago

@MelosY @echo840, so for the released TextMonkey model, is it trained using (x1,y1),(x2,y2) label data?