TX-2 / TX-2-simulator

Simulator for the pioneering TX-2 computer
https://tx-2.github.io/
MIT License
21 stars 2 forks source link

Unify representations of super/sub script in charsets and assembler #44

Open jamesyoungman opened 2 years ago

jamesyoungman commented 2 years ago

Describe the bug The assembler code operates on the assumption that the input is Unicode. So it accepts some non-ASCII character sequences and interprets them. For example the Unicode character U+22C5 "⋅" is accepted as a decimal point. See assembler/src/asmlib/parser.rs. This is not consistent with the character conversion code in base/src/charset.rs. We should make these things consistent.

To Reproduce See description above. The base/src/charset.rs isn't used by other code yet, so the inconsistency isn't visible in the program functionality.

Expected behavior

  1. Converting a sequence of LW codes to Unicode (using the charsets code) and feeding it to the assembler should result in a program that the assembler interprets correctly. Once we implement this, getting the assembler to emit a punched listing should yield a representation consistent with the LW codes we started with.

Additional context Our assembler is currently designed on the assumption that the input data is Unicode. Perhaps this gives ruse to too much complexity and we should modify it to expect LW codes on input. However, there is a reason why we didn't start out this way. The problem is that we have a scanned listing for Sketchpad, but no machine-readable source. We're going to have to convert the Sketchpad listing somehow.

It's going to be very challenging to do this via OCR because so many of the characters are unreadable blobs and because the timing behaviour of the Xerox Printer has means that the subscript/superscript/normal characters lack a consistent baseline. They're kind of wavy. So even a sophisticated OCR system is going to have trouble with this.

So I wanted to build the assembler in such a way that it could accept the output of an OCR system, in some form, and use its idea of what a valid input was to help identify the correct interpretation of what the OCR system is looking at. If we interpose a stage where we convert to LW codes, we have to decide whether particular characters are superscript/subscript in a part of the code that has no semantic understanding of what the data means.

One simple way to proceed for this issue is for the charsets code to export constants which represent particular things (e.g. superscript dot) and have the parser refer directly to these in e.g. tag() calls.