david-schmidt / tlc-apple2

Some bits and bobs regarding the Tiger Learning Computer, sort of an Apple IIe packaged by Tiger Electronics in 1996
4 stars 0 forks source link

code golf on joystick RS-232 grub bootstrap loader #6

Closed david-schmidt closed 1 year ago

david-schmidt commented 1 year ago

The hybrid joystick/audio ADTPro bootstrap loader is 128 bytes long - the poor soul that wants to do joystick bitbanging bootstrapping for themselves needs to type it all into the TLC monitor with a truly awful keyboard. Can you get my monstrosity down to 50% of its initial size? I bet so...

It has the following constraints:

  1. It needs to clear the screen sometime during initialization or operation (no, you can't send $20s to the screen via joystick)
  2. It needs to print a greeting message of two characters: "HI" somewhere on the screen
  3. It needs to put a throbber (or any visually changing indication) on the screen to indicate data movement - there’s no requirement to keep the screen clear once the process starts
  4. It needs to work (cycle counting is paramount in bitbanging)
  5. No error checking of any kind is required, but see constraint 4, above

The "protocol" on the wire is that it waits for a "T" byte ($54), once seeing that it then receives two bytes as a length (MSB, LSB), then continues to read bytes and save them to memory starting at $800 until the length is satisfied. When done, it jumps to $800.

My code makes use of uninitialized variable space that doesn't need to be included in the payload the user needs to type in. They are all in 16-bit space, so you can save space already by using zero page instead! First optimization is on me. :-)

My initial shot at this is over in the ADTPro repo here, and currently satisfies all of the constraints (and goes the extra mile to check for some RS-232 framing errors): https://github.com/ADTPro/adtpro/blob/tlc-grub/src/client/prodos/serial/grub2/grub2joy.asm Comment density is about 50%, so hopefully that will help with context.

david-schmidt commented 1 year ago

When I'm sitting at my TLC, this is the paper I have in front of me. It takes a while to bang this all in. It sure would be awesome of L$ could be 40 instead of 80!

300: 20 58 FC A9 00 A8 85 08
308: A9 08 85 09 A2 C8 8E 24
310: 04 E8 8E 25 04 20 47 03
318: C9 54 D0 F9 20 47 03 8D
320: 82 03 20 47 03 8D 83 03
328: 20 47 03 90 D4 91 08 8D
330: 27 04 C8 D0 05 E6 09 CE
338: 83 03 CC 82 03 D0 E9 AD
340: 83 03 D0 E4 4C 00 08 A9
348: 09 8D 80 03 18 AD 61 C0
350: 10 FB AD 61 C0 30 FB A2
358: 1D CA D0 FD 24 00 18 AD
360: 61 C0 30 03 4C 68 03 38
368: CE 80 03 F0 0F AD 81 03
370: 6A 8D 81 03 A2 0E CA D0
378: FD 4C 5E 03 AD 81 03 60

BSAVE GRUBTLC,A$300,L$80
david-schmidt commented 1 year ago

It could start at $7xx and "fall through" to $800 when length is satisfied... oh, except that's screen memory. You'd never be able to type it in. Never mind. Starting at $800 isn't negotiable - we're loading the literal ADTPro audio/joystick client directly to its initial location.

david-schmidt commented 1 year ago

Kent Dickey's submission:

300: 20 58 fc a2 c8 8e 24 04
308: e8 8e 25 04 20 3b 03 c9
310: 54 d0 f9 20 3b 03 85 01
318: 20 3b 03 85 02 a0 00 20
320: 3b 03 99 00 08 8d 27 04
328: c8 d0 05 ee 24 03 c6 02
330: a5 02 d0 eb c4 01 90 e7
338: 4c 00 08 a9 80 85 00 2c
340: 61 c0 30 fb a9 02 20 a8
348: fc a2 0a ea ea ca d0 fb
350: ad 61 c0 49 80 0a 66 00
358: 90 ef a2 0f ca d0 fd ad
360: 61 c0 90 03 a5 00 60 00

grub2kent.asm:

PB0  =  $c061

RCVBYTE  =  $00
size     =  $01 ; and $02
BUFP     =  $08 ; And $09

.org    $300

Entry:
    jsr $fc58   ; HOME

    ldx #$c8
    stx $424
    inx
    stx $425

poll:
    jsr get_byte
    cmp #$54
    bne poll

; Got signature, read data
    jsr get_byte
    sta size
    jsr get_byte
    sta size+1
    ldy #0

read:
    jsr get_byte
read_patch:
    sta $800,y
    sta $427
    iny
    bne skip_inc
    inc read_patch+2
    dec size+1
skip_inc:
    lda size+1
    bne read
    cpy size
    bcc read
    jmp $800

get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
    lda #$80
    sta RCVBYTE
wait_for_start:
    bit PB0
    bmi wait_for_start

; We got the start sometime in the last 10 clocks.  Wait 1.5 bit times so
; grab the bits in the middle of the bit times.  There are 106 CPU cycles in
; one bit time.  So 1.5 bit times = 159 cycles
; https://6502disassembly.com/a2-rom/ says delay is A*A*2.5 + A*13.5 + 13
; A=1 -> 29
; A=2 -> 50
; A=3 -> 76 cycles
; A=4 -> 107 cycles
; A=5 -> 143 cycles
    lda #$2 ; 2
    jsr $fca8   ; WAIT: 50 cycles for JSR and called routine

read_bit:
    ldx #10 ; 2
read_bit2:
    nop
    nop
    dex     
    bne read_bit2   ; Total cycles: X*9 - 1 = 89
    lda PB0 ; 4
    eor #$80    ; 2
    asl     ; 2
    ror RCVBYTE ; 5
    bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+4+2+2+5+3=18 clocks
; We need the wait to take 88 more clocks, but we actually wait 89.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+52+2+89+4=153

; If we get here, we have the byte in RCVBYTE.  We are 2+2+5+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
    ldx #15
wait_stop:
    dex
    bne wait_stop   ; 74: Total = X*5 - 1
    lda PB0
    bcc bad
    lda RCVBYTE
    rts
bad:
    brk
david-schmidt commented 1 year ago

After more work, Kent's contribution is down to 94 bytes:

300: 20 58 FC A2 C8 8E 24 04
308: E8 8E 25 04 20 3B 03 C9
310: 54 D0 F9 20 3B 03 85 07
318: 20 3B 03 85 08 A0 00 20
320: 3B 03 99 00 08 8D 27 04
328: C8 D0 05 EE 24 03 C6 08
330: A5 08 D0 EB C4 07 90 E7
338: 4C 00 08 A9 80 85 06 2C
340: 61 C0 30 FB A2 1D D0 02
348: A2 12 CA D0 FD AD 61 C0
350: 0A 66 06 90 F3 A2 0D CA
358: D0 FD A5 06 EA 60
PB0  =  $c061

RCVBYTE  =  $06
size     =  $07 ; and $08

    .org    $300
Entry:
    jsr $fc58   ; HOME

    ldx #$c8
    stx $424
    inx

    stx $425

poll:
    jsr get_byte
    cmp #$54
    bne poll

; Got signature, read data
    jsr get_byte
    sta size
    jsr get_byte
    sta size+1
    ldy #0

read:
    jsr get_byte
read_patch:
    sta $800,y
    sta $427
    iny
    bne skip_inc
    inc read_patch+2
    dec size+1
skip_inc:
    lda size+1
    bne read
    cpy size
    bcc read
    jmp $800

get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
    lda #$80
    sta RCVBYTE
wait_for_start:
    bit PB0 ; 4
    bmi wait_for_start  ; 2

; We got the start sometime in the last 3-10 clocks.  Wait 1.5 bit times so
; grab the bits in the middle of the bit times.  There are $6a CPU cycles in
; one bit time.  So 1.5 bit times = $9f cycles.
    ldx #$1d        ; 2
    bne read_bit2   ; 3

read_bit:
    ldx #$12    ; 2
read_bit2:
    dex     ; 2
    bne read_bit2   ; Total cycles: X*5 - 1 = $90 (for first bit)
    lda PB0 ; 4
    asl     ; 2
    ror RCVBYTE ; 5
    bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+4+2+5+3=$10 clocks
; We need the wait to take $5a clocks (so $5a+$10=$6a), but we actually wait $59.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+3+$90=$9b

; If we get here, we have the byte in RCVBYTE.  We are 2+2+5+2 cycles past
; the middle of the last bit, and we want to wait about $5a clocks total
; before returning.
    ldx #$0d        ;2
wait_stop:
    dex             ; 2
    bne wait_stop   ; $40: Total = X*5 - 1
    lda RCVBYTE  ; 2
    nop          ; 2
    rts

; from last lda PB0 to the fastest call back (count just the jsr):
; 2+5+2 + 2+$40+2+2+6+6=$5e.  We are solidly in the middle of the stop bit.
; Code should call back get_byte within $3c clocks.
david-schmidt commented 1 year ago

With more work from Kent and Peter Ferrie we're at 84 bytes. Additional savings can be had if we start modifying the payload to do various things such as fall through from the screen to $800 (including screen hole data), encoding an EOF marker in the data rather than a header specifying the length, and likely others as well. I think we're at a decent compromise for now, but will happily entertain those ideas in mind if we want to get extra frisky:

PB0  =  $c061
size =  $07 ; and $08

    .org $300
Entry:
    jsr $fc58   ; HOME

    ldx #$c8
    stx $424
    inx
    stx $425

poll:
    jsr get_byte
    cmp #$54
    bne read_patch+1    ; Hit a BRK

; Got signature, read data
    ldx #$fe
get_size:
    jsr get_byte
    sta size+2,x
    inx
    bne get_size

read:
    jsr get_byte
read_patch:
    sta $800,x
    sta $427
    inx
    bne skip_inc
    inc read_patch+2
    dec size+1
skip_inc:
    lda size+1
    bne read
    cpx size
    bcc read
    jmp $800

get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
wait_for_start:
    lda PB0
    bmi wait_for_start  ; 2

; We got the start sometime in the last 3-10 clocks.  Wait 1.5 bit times so
; grab the bits in the middle of the bit times.  There are 106 CPU cycles in
; one bit time.  So 1.5 bit times = 159 cycles
    ldy #29     ; 2
    lda #$80        ; 2
    .byte  $2c     ; 4 (and skip ldx #19)

read_bit:
    ldy #19 ; 2
read_bit2:
    dey     
    bne read_bit2   ; Total cycles: X*5 - 1 = 89
    asl PB0 ; 6 (4 cycles to the read, 2 more cycle to shift and wr)
    ror     ; 2
    bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+6+2+3=13 clocks
; We need the wait to take 93 clocks (so 93+13=106), but we actually wait 94.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+2+4+144=158

; If we get here, we have the byte in acc.  We are 2+2+2+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
    ldy #13     ;2
wait_stop:
    dey
    bne wait_stop   ; 64: Total = X*5 - 1
    rts

; from last lda PB0 to the fastest call back (count just the jsr):
; 2+2+2 + 2+64+6+6=84.  We are solidly in the middle of the stop bit.
; Code should call back get_byte within 60 clocks.
david-schmidt commented 1 year ago

Here is where I'm going to claim victory with Kent Dickey and Peter Ferrie's gracious help - less than 50% of my original code size, and doesn't require any pacing from the host to run at full 9600 baud.

PB0  =  $c061

    .org    $300
Entry:
    jsr $fc58   ; HOME

    ldx #$3B    ; $3B00 bytes of payload (with $30 byte header)

wait_byte:
    jsr get_byte
    cmp #$54
    bne wait_byte

read_payload:
    jsr get_byte
; from last lda PB0 to the fastest call back (count just the jsr):
; 2+2+2 + 2+64+6+6=84.  We are solidly in the middle of the stop bit.
; Code should call back get_byte within 60 clocks.

data_patch:
    sta $7d0
    sta $427
    inc data_patch+1
    bne read_payload
    inc data_patch+2
    dex
    bne read_payload
    jmp $800

get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
wait_for_start:
    lda PB0
    bmi wait_for_start  ; 2

; We got the start sometime in the last 3-10 clocks.  Wait 1.5 bit times so
; grab the bits in the middle of the bit times.  There are 106 CPU cycles in
; one bit time.  So 1.5 bit times = 159 cycles
    ldy #29     ; 2
    lda #$80        ; 2

read_bit:
read_bit2:
    dey     
    bne read_bit2   ; Total cycles: Y*5 - 1 = 94
    ldy #19 ; 2
    asl PB0 ; 6 (4 cycles to the read, 2 more cycle to shift and wr)
    ror     ; 2     
    bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+6+2+3=13 clocks
; We need the wait to take 93 clocks (so 93+13=106), but we actually wait 94.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+2+4+144=158

; If we get here, we have the byte in acc.  We are 2+2+2+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
    ldy #13     ;2
wait_stop:
    dey
    bne wait_stop   ; 64: Total = Y*5 - 1
    rts
300: 20 58 FC A2 3C 20 23 03
308: C9 54 D0 F9 20 23 03 8D
310: D0 07 8D 27 04 EE 10 03
318: D0 F2 EE 11 03 CA D0 EC
320: 4C 00 08 AD 61 C0 30 FB
328: A0 1D A9 80 88 D0 FD A0
330: 13 0E 61 C0 6A 90 F5 A0
338: 0D 88 D0 FD 60