apsexton / bateman-ocr

Tools and experiments in the OCR of the Bateman Manuscripts
ISC License
0 stars 5 forks source link

Suppress Table Lines and Graphics #13

Open apsexton opened 8 years ago

apsexton commented 8 years ago

Identify reasonable character sizes by considering the distribution of sizes of bounding boxes of the connected components in the page: the smallest CCs will be either noise, full stops or the points above lower case i and j letters. As the size increases, we find first the smallest font size characters, then larger and larger characters, then large fence symbols (braces, parentheses, etc,) and division lines, then graphics (diagrams, plots) and table lines. Since characters occur much more often in the page than the larger CCs, one can easily find a size limit to distinguish characters from non-characters.

Identify any connected components that are significantly wider and taller than reasonable character sizes under the assumption that these are either table lines or graphics. For the moment, we will simply suppress these objects: i.e. not use them in our layout analysis but also not allow them to interfere with the rest of the analysis. Later on we will return to tables and integrate them into our analysis. For the moment, simply draw the bounding boxes of the identified table lines or diagrams in a different colour to those of character CCs.

LibriCerule commented 8 years ago

When a variable or constant is raised to the power of 1/2, will the small 1 or small 2 be considered characters or will the 1/2 as a whole only be considered as a character?

LibriCerule commented 8 years ago

screenshot-1

I have a few questions on the identification of a character CC. One problem arises with the equal signs where the top bar is considered a character while the bottom one is not. Should I consider them both as a character or not a character? Other questionable characters come from the fraction figures, especially those in exponential form. If I do take the small characters from the numerator and denominator as characters, the threshold becomes so small that cases like the three dots occur. In this case, the difference in areas between the three dots is very minimal, but because I need to account for accepting the small numbers in fractions, it's based purely on luck if the resolution picked up on the maximum amount of pixels in the image. Do you have any suggestions on solving these two problems?