getting x,y positions of the text in the original image?

facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents

https://facebookresearch.github.io/nougat/

MIT License

8.98k stars 567 forks source link

getting x,y positions of the text in the original image? #37

Closed archywillhe closed 1 year ago

archywillhe commented 1 year ago

Is there a way to compute rect boxes for the text detected? Or know roughly the starting x,y coordinate of a text paragraph?

lukas-blecher commented 1 year ago

That's not really something you can do with nougat. The model works end to end and doesn't compute bounding boxes for text. You could try to match the text to blocks extracted by eg mupdf, which has the x,y coordinates. Or if really want to use the model, you could try to make sense of the attention maps, but nothing I recommend, really.