Code duplication and maintainability ambiguities in font and metric generation

IvoWingelaar commented 2 years ago

Before reporting a bug

[x] Check common issues.

[x] Search for existing issues.

I've been rewriting the code responsible for font, and the associated metrics, generation from the Perl scripts to a (IMHO) more readable (and therefore maintainable) collection of Python scripts. The font generation scripts are finished, but I've ran into some situations that point to possible issues when rewriting the metric generation.

The mapping from TeX font character codes to Unicode scalar values is defined twice, once in the src/fonts/makeFF Perl script, and a second time in the src/metrics/mapping.pl Perl script. These two mappings are very slightly different. Is this an intentional design decision? The overwhelming majority of mapping pairs are the same. It seems a chore and risky with regards to possible bugs to maintain two almost equal datasets of >1000 lines.
Some of the mappings overlap symbols, as for example in makeFF: cmr10 0x2C -> Main-Regular 0x2C, while the following mapping also exists: cmmi10 0x3B -> Main-Regular 0x2C. The makeFF script sorts certain keys of the mappings, so I believe the produced fonts stays deterministic in which glyphs survive these competing mappings, but this seems confusing. Other duplicate mappings are: cmr10 0x2E -> Main-Regular 0x2E and cmmi10 0x3A -> Main-Regular 0x2E, cmbx10 0x2E -> Main-Bold 0x2E and cmmib10 0x3A -> Main-Bold 0x2E, cmbx10 0x2C -> Main-Bold 0x2C and cmmib10 0x3B -> Main-Bold 0x2C.
A pseudo-duplicate mapping also exists in the Typewriter mapping for 0x7E, but there the intent is clear as the second mapping is between the same glyphs as the first one, but the second one provides an additional y shift.

I've tried so search the archived katex-fonts repository for more information about these design choices, but I couldn't find it, so are these decisions intentional?

Environment (please complete the following information):

KaTeX Version: latest commit (ef49e2be91628e96e4fcb962b28b443930627d84)

edemaine commented 2 years ago

I believe the messyness in the code (and those duplicate tables) are inherited from MathJax code that this was originally based on. But it predates my time so I'm not sure.

I agree that it would be best to remove this risky duplicated code (in particular resolving the discrepancies), and to avoid duplicates (and maybe detect duplicates to avoid this in the future?). Or to otherwise rewrite this code in a better way.

IvoWingelaar commented 2 years ago

The issue is not as bad as I thought. I re-used the font generation mapping to also generate the metrics, and get the following diff on my newly generated fontMetricsData.js:

Adds all Fraktur-Bold and Caligraphic-Bold metrics: they are absent in the current version apparently.
Adds 2 missing metrics to Main-Italic and 5 to Typewriter-Regular.
Removes 3 metrics (corresponding to glyphs that are currently unused) from Typewriter-Regular.
And only changes the metrics on 1 glyph in Typewriter-Regular: (0u7E - Combining Tilde) has changed height/depth.

I'm confident my work's in a state that can be reviewed, so how granular do you want my PR to be? A single PR for both the font-generation and the metrics generation changes? Or two separate ones?

Afterwards, it's not that difficult to add a sanity check to detect duplicate mappings. My next goal though is to add the pair-kerning tables to the fonts.

IvoWingelaar commented 2 years ago

I've taken the initiative to prepare to split it in multiple PR's, as the changes can be semantically grouped, with the goal to make it easier to review. A huge delta of multiple KLOC's added/deleted seems like bad etiquette, hopefully a chain of PR's is less bad. This chain is #3713, #3714, #3715. When this chain is approved and merged I can start working on implementing pair-kerning, but this is feature where a separate issue to track progress seems more appropriate.

KaTeX / KaTeX

Code duplication and maintainability ambiguities in font and metric generation #3702