Open abraker95 opened 4 years ago
So from initial research:
Some nuances concerning box files:
<symbol> <left> <bottom> <right> <top> <page>
, where symbol
is character, left
bottom
right
and top
are px coordinates. page
should be 0 unless TIFF files are multi-page.Sample box data:
d 112 4654 136 4690 0
i 141 4654 147 4690 0
f 149 4654 166 4691 0
f 163 4654 180 4691 0
e 179 4653 204 4681 0
r 208 4653 224 4680 0
e 224 4653 249 4680 0
n 253 4653 276 4680 0
t 279 4653 292 4688 0
292 4653 310 4689 0
N 310 4653 339 4689 0
e 344 4653 369 4680 0
w 368 4652 408 4679 0
Look into training the OCR to detect the NotoSans font better. I am using tesseract version 5.0.0-alpha.20200328
See: