Closed david-schmidt closed 1 year ago
When I'm sitting at my TLC, this is the paper I have in front of me. It takes a while to bang this all in. It sure would be awesome of L$ could be 40 instead of 80!
300: 20 58 FC A9 00 A8 85 08
308: A9 08 85 09 A2 C8 8E 24
310: 04 E8 8E 25 04 20 47 03
318: C9 54 D0 F9 20 47 03 8D
320: 82 03 20 47 03 8D 83 03
328: 20 47 03 90 D4 91 08 8D
330: 27 04 C8 D0 05 E6 09 CE
338: 83 03 CC 82 03 D0 E9 AD
340: 83 03 D0 E4 4C 00 08 A9
348: 09 8D 80 03 18 AD 61 C0
350: 10 FB AD 61 C0 30 FB A2
358: 1D CA D0 FD 24 00 18 AD
360: 61 C0 30 03 4C 68 03 38
368: CE 80 03 F0 0F AD 81 03
370: 6A 8D 81 03 A2 0E CA D0
378: FD 4C 5E 03 AD 81 03 60
BSAVE GRUBTLC,A$300,L$80
It could start at $7xx and "fall through" to $800 when length is satisfied... oh, except that's screen memory. You'd never be able to type it in. Never mind. Starting at $800 isn't negotiable - we're loading the literal ADTPro audio/joystick client directly to its initial location.
Kent Dickey's submission:
300: 20 58 fc a2 c8 8e 24 04
308: e8 8e 25 04 20 3b 03 c9
310: 54 d0 f9 20 3b 03 85 01
318: 20 3b 03 85 02 a0 00 20
320: 3b 03 99 00 08 8d 27 04
328: c8 d0 05 ee 24 03 c6 02
330: a5 02 d0 eb c4 01 90 e7
338: 4c 00 08 a9 80 85 00 2c
340: 61 c0 30 fb a9 02 20 a8
348: fc a2 0a ea ea ca d0 fb
350: ad 61 c0 49 80 0a 66 00
358: 90 ef a2 0f ca d0 fd ad
360: 61 c0 90 03 a5 00 60 00
grub2kent.asm:
PB0 = $c061
RCVBYTE = $00
size = $01 ; and $02
BUFP = $08 ; And $09
.org $300
Entry:
jsr $fc58 ; HOME
ldx #$c8
stx $424
inx
stx $425
poll:
jsr get_byte
cmp #$54
bne poll
; Got signature, read data
jsr get_byte
sta size
jsr get_byte
sta size+1
ldy #0
read:
jsr get_byte
read_patch:
sta $800,y
sta $427
iny
bne skip_inc
inc read_patch+2
dec size+1
skip_inc:
lda size+1
bne read
cpy size
bcc read
jmp $800
get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
lda #$80
sta RCVBYTE
wait_for_start:
bit PB0
bmi wait_for_start
; We got the start sometime in the last 10 clocks. Wait 1.5 bit times so
; grab the bits in the middle of the bit times. There are 106 CPU cycles in
; one bit time. So 1.5 bit times = 159 cycles
; https://6502disassembly.com/a2-rom/ says delay is A*A*2.5 + A*13.5 + 13
; A=1 -> 29
; A=2 -> 50
; A=3 -> 76 cycles
; A=4 -> 107 cycles
; A=5 -> 143 cycles
lda #$2 ; 2
jsr $fca8 ; WAIT: 50 cycles for JSR and called routine
read_bit:
ldx #10 ; 2
read_bit2:
nop
nop
dex
bne read_bit2 ; Total cycles: X*9 - 1 = 89
lda PB0 ; 4
eor #$80 ; 2
asl ; 2
ror RCVBYTE ; 5
bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+4+2+2+5+3=18 clocks
; We need the wait to take 88 more clocks, but we actually wait 89.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+52+2+89+4=153
; If we get here, we have the byte in RCVBYTE. We are 2+2+5+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
ldx #15
wait_stop:
dex
bne wait_stop ; 74: Total = X*5 - 1
lda PB0
bcc bad
lda RCVBYTE
rts
bad:
brk
After more work, Kent's contribution is down to 94 bytes:
300: 20 58 FC A2 C8 8E 24 04
308: E8 8E 25 04 20 3B 03 C9
310: 54 D0 F9 20 3B 03 85 07
318: 20 3B 03 85 08 A0 00 20
320: 3B 03 99 00 08 8D 27 04
328: C8 D0 05 EE 24 03 C6 08
330: A5 08 D0 EB C4 07 90 E7
338: 4C 00 08 A9 80 85 06 2C
340: 61 C0 30 FB A2 1D D0 02
348: A2 12 CA D0 FD AD 61 C0
350: 0A 66 06 90 F3 A2 0D CA
358: D0 FD A5 06 EA 60
PB0 = $c061
RCVBYTE = $06
size = $07 ; and $08
.org $300
Entry:
jsr $fc58 ; HOME
ldx #$c8
stx $424
inx
stx $425
poll:
jsr get_byte
cmp #$54
bne poll
; Got signature, read data
jsr get_byte
sta size
jsr get_byte
sta size+1
ldy #0
read:
jsr get_byte
read_patch:
sta $800,y
sta $427
iny
bne skip_inc
inc read_patch+2
dec size+1
skip_inc:
lda size+1
bne read
cpy size
bcc read
jmp $800
get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
lda #$80
sta RCVBYTE
wait_for_start:
bit PB0 ; 4
bmi wait_for_start ; 2
; We got the start sometime in the last 3-10 clocks. Wait 1.5 bit times so
; grab the bits in the middle of the bit times. There are $6a CPU cycles in
; one bit time. So 1.5 bit times = $9f cycles.
ldx #$1d ; 2
bne read_bit2 ; 3
read_bit:
ldx #$12 ; 2
read_bit2:
dex ; 2
bne read_bit2 ; Total cycles: X*5 - 1 = $90 (for first bit)
lda PB0 ; 4
asl ; 2
ror RCVBYTE ; 5
bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+4+2+5+3=$10 clocks
; We need the wait to take $5a clocks (so $5a+$10=$6a), but we actually wait $59.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+3+$90=$9b
; If we get here, we have the byte in RCVBYTE. We are 2+2+5+2 cycles past
; the middle of the last bit, and we want to wait about $5a clocks total
; before returning.
ldx #$0d ;2
wait_stop:
dex ; 2
bne wait_stop ; $40: Total = X*5 - 1
lda RCVBYTE ; 2
nop ; 2
rts
; from last lda PB0 to the fastest call back (count just the jsr):
; 2+5+2 + 2+$40+2+2+6+6=$5e. We are solidly in the middle of the stop bit.
; Code should call back get_byte within $3c clocks.
With more work from Kent and Peter Ferrie we're at 84 bytes. Additional savings can be had if we start modifying the payload to do various things such as fall through from the screen to $800 (including screen hole data), encoding an EOF marker in the data rather than a header specifying the length, and likely others as well. I think we're at a decent compromise for now, but will happily entertain those ideas in mind if we want to get extra frisky:
PB0 = $c061
size = $07 ; and $08
.org $300
Entry:
jsr $fc58 ; HOME
ldx #$c8
stx $424
inx
stx $425
poll:
jsr get_byte
cmp #$54
bne read_patch+1 ; Hit a BRK
; Got signature, read data
ldx #$fe
get_size:
jsr get_byte
sta size+2,x
inx
bne get_size
read:
jsr get_byte
read_patch:
sta $800,x
sta $427
inx
bne skip_inc
inc read_patch+2
dec size+1
skip_inc:
lda size+1
bne read
cpx size
bcc read
jmp $800
get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
wait_for_start:
lda PB0
bmi wait_for_start ; 2
; We got the start sometime in the last 3-10 clocks. Wait 1.5 bit times so
; grab the bits in the middle of the bit times. There are 106 CPU cycles in
; one bit time. So 1.5 bit times = 159 cycles
ldy #29 ; 2
lda #$80 ; 2
.byte $2c ; 4 (and skip ldx #19)
read_bit:
ldy #19 ; 2
read_bit2:
dey
bne read_bit2 ; Total cycles: X*5 - 1 = 89
asl PB0 ; 6 (4 cycles to the read, 2 more cycle to shift and wr)
ror ; 2
bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+6+2+3=13 clocks
; We need the wait to take 93 clocks (so 93+13=106), but we actually wait 94.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+2+4+144=158
; If we get here, we have the byte in acc. We are 2+2+2+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
ldy #13 ;2
wait_stop:
dey
bne wait_stop ; 64: Total = X*5 - 1
rts
; from last lda PB0 to the fastest call back (count just the jsr):
; 2+2+2 + 2+64+6+6=84. We are solidly in the middle of the stop bit.
; Code should call back get_byte within 60 clocks.
Here is where I'm going to claim victory with Kent Dickey and Peter Ferrie's gracious help - less than 50% of my original code size, and doesn't require any pacing from the host to run at full 9600 baud.
PB0 = $c061
.org $300
Entry:
jsr $fc58 ; HOME
ldx #$3B ; $3B00 bytes of payload (with $30 byte header)
wait_byte:
jsr get_byte
cmp #$54
bne wait_byte
read_payload:
jsr get_byte
; from last lda PB0 to the fastest call back (count just the jsr):
; 2+2+2 + 2+64+6+6=84. We are solidly in the middle of the stop bit.
; Code should call back get_byte within 60 clocks.
data_patch:
sta $7d0
sta $427
inc data_patch+1
bne read_payload
inc data_patch+2
dex
bne read_payload
jmp $800
get_byte:
; The serial line must be idle (PB0 must have it high bit set)
; We simply must wait for the beginning of the start bit (PB0 high bit clear)
wait_for_start:
lda PB0
bmi wait_for_start ; 2
; We got the start sometime in the last 3-10 clocks. Wait 1.5 bit times so
; grab the bits in the middle of the bit times. There are 106 CPU cycles in
; one bit time. So 1.5 bit times = 159 cycles
ldy #29 ; 2
lda #$80 ; 2
read_bit:
read_bit2:
dey
bne read_bit2 ; Total cycles: Y*5 - 1 = 94
ldy #19 ; 2
asl PB0 ; 6 (4 cycles to the read, 2 more cycle to shift and wr)
ror ; 2
bcc read_bit ; 3
; Above overhead, not counting read_bit2 loop, is 2+6+2+3=13 clocks
; We need the wait to take 93 clocks (so 93+13=106), but we actually wait 94.
; We'll just slip one cycle, it's fine
; Delay from wait_for_start to lda PB0 in read_bit2 is: 4+2+2+2+4+144=158
; If we get here, we have the byte in acc. We are 2+2+2+2 cycles past
; the middle of the last bit, and we want to wait about 90 clocks total
; before returning.
ldy #13 ;2
wait_stop:
dey
bne wait_stop ; 64: Total = Y*5 - 1
rts
300: 20 58 FC A2 3C 20 23 03
308: C9 54 D0 F9 20 23 03 8D
310: D0 07 8D 27 04 EE 10 03
318: D0 F2 EE 11 03 CA D0 EC
320: 4C 00 08 AD 61 C0 30 FB
328: A0 1D A9 80 88 D0 FD A0
330: 13 0E 61 C0 6A 90 F5 A0
338: 0D 88 D0 FD 60
The hybrid joystick/audio ADTPro bootstrap loader is 128 bytes long - the poor soul that wants to do joystick bitbanging bootstrapping for themselves needs to type it all into the TLC monitor with a truly awful keyboard. Can you get my monstrosity down to 50% of its initial size? I bet so...
It has the following constraints:
The "protocol" on the wire is that it waits for a "T" byte ($54), once seeing that it then receives two bytes as a length (MSB, LSB), then continues to read bytes and save them to memory starting at $800 until the length is satisfied. When done, it jumps to $800.
My code makes use of uninitialized variable space that doesn't need to be included in the payload the user needs to type in. They are all in 16-bit space, so you can save space already by using zero page instead! First optimization is on me. :-)
My initial shot at this is over in the ADTPro repo here, and currently satisfies all of the constraints (and goes the extra mile to check for some RS-232 framing errors): https://github.com/ADTPro/adtpro/blob/tlc-grub/src/client/prodos/serial/grub2/grub2joy.asm Comment density is about 50%, so hopefully that will help with context.