Describe the bug
The assembler code operates on the assumption that the input is Unicode. So it accepts some non-ASCII character sequences and interprets them. For example the Unicode character U+22C5 "⋅" is accepted as a decimal point. See assembler/src/asmlib/parser.rs. This is not consistent with the character conversion code in base/src/charset.rs. We should make these things consistent.
To Reproduce
See description above. The base/src/charset.rs isn't used by other code yet, so the inconsistency isn't visible in the program functionality.
Expected behavior
Converting a sequence of LW codes to Unicode (using the charsets code) and feeding it to the assembler should result in a program that the assembler interprets correctly. Once we implement this, getting the assembler to emit a punched listing should yield a representation consistent with the LW codes we started with.
Additional context
Our assembler is currently designed on the assumption that the input data is Unicode. Perhaps this gives ruse to too much complexity and we should modify it to expect LW codes on input. However, there is a reason why we didn't start out this way. The problem is that we have a scanned listing for Sketchpad, but no machine-readable source. We're going to have to convert the Sketchpad listing somehow.
It's going to be very challenging to do this via OCR because so many of the characters are unreadable blobs and because the timing behaviour of the Xerox Printer has means that the subscript/superscript/normal characters lack a consistent baseline. They're kind of wavy. So even a sophisticated OCR system is going to have trouble with this.
So I wanted to build the assembler in such a way that it could accept the output of an OCR system, in some form, and use its idea of what a valid input was to help identify the correct interpretation of what the OCR system is looking at. If we interpose a stage where we convert to LW codes, we have to decide whether particular characters are superscript/subscript in a part of the code that has no semantic understanding of what the data means.
One simple way to proceed for this issue is for the charsets code to export constants which represent particular things (e.g. superscript dot) and have the parser refer directly to these in e.g. tag() calls.
Describe the bug The assembler code operates on the assumption that the input is Unicode. So it accepts some non-ASCII character sequences and interprets them. For example the Unicode character U+22C5 "⋅" is accepted as a decimal point. See
assembler/src/asmlib/parser.rs
. This is not consistent with the character conversion code inbase/src/charset.rs
. We should make these things consistent.To Reproduce See description above. The
base/src/charset.rs
isn't used by other code yet, so the inconsistency isn't visible in the program functionality.Expected behavior
Additional context Our assembler is currently designed on the assumption that the input data is Unicode. Perhaps this gives ruse to too much complexity and we should modify it to expect LW codes on input. However, there is a reason why we didn't start out this way. The problem is that we have a scanned listing for Sketchpad, but no machine-readable source. We're going to have to convert the Sketchpad listing somehow.
It's going to be very challenging to do this via OCR because so many of the characters are unreadable blobs and because the timing behaviour of the Xerox Printer has means that the subscript/superscript/normal characters lack a consistent baseline. They're kind of wavy. So even a sophisticated OCR system is going to have trouble with this.
So I wanted to build the assembler in such a way that it could accept the output of an OCR system, in some form, and use its idea of what a valid input was to help identify the correct interpretation of what the OCR system is looking at. If we interpose a stage where we convert to LW codes, we have to decide whether particular characters are superscript/subscript in a part of the code that has no semantic understanding of what the data means.
One simple way to proceed for this issue is for the charsets code to export constants which represent particular things (e.g. superscript dot) and have the parser refer directly to these in e.g.
tag()
calls.