OCR-D / gt-guidelines

OCR-D guidelines for Ground Truth production
https://ocr-d.de/en/gt-guidelines/trans/
Creative Commons Attribution Share Alike 4.0 International
6 stars 5 forks source link

How to encode mathematical fractions? #24

Open kba opened 3 years ago

kba commented 3 years ago

While Unicode does have codepoints for the most common fractions (¼, ½, ¾ etc). this does not scale because of course not all possible numerator/denominator combinations are available. So it might be best to encode fractions as just "numerator fraction-slash denominator" (with regular numbers or super/subscript numbers?) or even produce LaTeX syntax.

bertsky commented 3 years ago

with regular numbers or super/subscript numbers?

no, regular numbers are what Unicode suggests for this. The typical small-script font appearance is implemented by Unicode renderers merely because of the pattern numeral fraction-slash numeral, i.e. both the numerator and denominator are ordinary (ASCII) numerals. (You can try it out with an editor/browser of your choice, e.g. ¾⅔ (precomposed) vs 3⁄4 2⁄3 (independent but rendered equally by good fonts/engines – GH obviously is not one of them).

or even produce LaTeX syntax

I'd recommend against that. LSTM-CTC will learn to give you character sequences, but getting a certain syntax consistently is pure luck.

Note: the actual argument for differentiating fraction slash against ordinary slash goes as follows: on the visual side, a fraction will always be discernable from other numeric expressions involving slash (like dates or identifiers/codes), because it looks super/subscripted, so the OCR can learn that. That's even independent of the decision whether super/subscript numbers should be represented as such (or ordinary numbers).

tboenig commented 3 years ago