Multibyte write and copy

visual-trials commented 1 year ago

-- This feature has been made possible due to the help of many others on the X16 discord channel, most notably: MooingLemur, Xark, Wavicle (jburks) and Yazarchy. Many thanks to them! --

Writing and copying multiple bytes at once

This change allows for multibyte writing and copying of VRAM data. Its main purpose is to increase performance.

Register usage

This features uses two unused bits: Bit 1 and 2 of the ADDRx_H register, called Write pattern.

Addr	Name	Bit 7	Bit 6	Bit 5	Bit 4	Bit 3	Bit 2	Bit 1	Bit 0
$02	ADDRx_H (x=ADDRSEL)	Address Increment				DECR	Write pattern		VRAM Address (16)

How it works

The number of bytes that are written at once to VRAM can be changed by setting the two bits in Write pattern. In the default mode (both bits set to 0) only one VRAM byte is changed when a byte is written to DATA0 or DATA1. This is the byte that exactly corresponds to the address written to. When the Write pattern is set differently multiple bytes (normally with the same value) are written to VRAM at the same time.

Up to 4 bytes can be written to VRAM at once, aligned to a 32-bit address. Which of the 4 bytes are overwritten can be controlled: all possible patterns of bytes written to VRAM can be set by combining the 2 bits of Write pattern and the lower 2 bits of the address that is written to. Only the pattern that writes nothing to VRAM cannot be set.

Below is the mapping:

		address % 4
		0	1	2	3
bits 1 and 2 ($9F22)	00	+---	-+--	--+-	---+
	01	++++	-+-+	+-+-	++--
	10	tr. blit	++-+	+++-	-++-
	11	blit	+-++	-+++	--++

FIXME Update the text below to the new pattern layout!

Note: Each byte pattern is consists of four +'s and -'s. Each - or + represents a byte in VRAM. A + means the byte is written to and a - means the byte in VRAM is untouched.

Blitting

A special combination of the lower 2 bits of the address and the 2 bits in Write pattern is used to signify copying of (up to) 4 bytes at the same time (aka "blitting"). This is when address % 4 == 0 and the 2-bit pattern is 11b. This works as follows:

Whenever there is a read from DATA0 or DATA1 (and this "blit-setting" is the case) an internal 32-bit cache is filled with the 32-bit value at VRAM address ADDR0 or ADDR1 (aligned to 32-bit).
Whenever there is a write of value 0 to DATA0 or DATA1 (and this "blit-setting" is the case) this stored 32-bit cache is written to the VRAM address ADDR0 or ADDR1 (aligned to 32-bit).

This effectively allows for copying 4 bytes at the same time from VRAM to VRAM.

Masked blitting

It is also possible to blit parts of the 32-bit cache. The value written to VERA (using a sta DATA0 or sta DATA1 in the above step 2) is namely used as an inverted nibble mask. For example: writing the value 11000011b to DATA0 in blit mode will mean that only the two middle bytes of a 32-bit value are copied.

Setting a bit to 0 in this inverted mask will make sure the corresponding nibble will be copied. Setting a bit to 1 will let a nibble be untouched during the blit. The Least Significant Bit (LSB) will affect the nibble with the lowest VRAM address (the left most pixel) and the MSB will affect the nibble with the highest VRAM address (the right most pixel).

Note: the above feature (multibyte writing and copying) is limited to main Video RAM: $00000-$1F9BF

Example code

In effect you can write 4 bytes of VRAM at the same time using a single 6502 sta command (after you setup the registers correctly):

lda #0          ; Set address to % 4 == 0
sta $9F20
lda #0
sta $9F21

lda #%00110100  ; Set increment to 4 and bits 1 and 2 to 10b
sta $9F22

lda #42
sta DATA0       ; Write color value 4 bytes at a time
sta DATA0
sta DATA0
sta DATA0
sta DATA0
sta DATA0
...

To copy VRAM to VRAM quickly you can do use an lda and sta (with a "blit-setting" configured):

lda #0          ; Set address to % 4 == 0
sta $9F20
sta $9F21

lda #%00110110  ; Set increment to 4 and bits 1 and 2 to 11b
sta $9F22

lda DATA1       ; Copy 4 bytes at a time
stz DATA0       ; Writing a 0 during the blit will do a full blit (all 8 nibbles)
lda DATA1
stz DATA0
lda DATA1
stz DATA0
lda DATA1
stz DATA0
lda DATA1
...

Use cases

Here are a few use cases:

Writing 4 bytes at the time (pattern: ++++)
- Clearing a bitmap screen 4 times as fast
- Drawing (large) polygons much faster
Writing 2-3 consecutive bytes at the same time (patterns: ++--, -++-, --++, +++-, -+++)
- Drawing multiple (textured) pixel columns at the same time
- In essence: just like deferred column rendering in DOOM and Wolfenstein 3D worked.
Drawing non-consecutive 2 bytes at the same time (patterns: +-+-, -+-+)
- Updating tile attributes (which are spaced 2 bytes apart) more quickly
Copying 4 bytes of data from VRAM to VRAM (blit)
- Copying tiledata 4 times as fast (for tile animations)
- Copying bitmap data 4 times as fast
Partial copying of up to 4 bytes from VRAM to VRAM (masked blit)
- Copying an image with a transparency mask map
- Drawing pseudo random dithering pixels

This shows the speed difference (on real HW) of filling/clearing the screen (source):

Clearing screen 1 byte per write Clearing screen 4 bytes per write

This show the speed difference (on real HW) of copying bitmap data (source):

Copy bitmap 1 byte per copy Copy bitmap 4 bytes per copy

Backwards compatibility and HW testing

This feature uses 2 bits that were previously unused. If set to 00b it is completely backwards compatible, since it will do basic 1 byte writes to VRAM. Only if the bits are set differently will the behaviour change of VERA.

This features has been tested for backwards compatability for the following X16 application/demos/games:

Application/Demo/Game	Tested by
X16 Hardware Tester	JeffreyH
X16 Kernal / Basic	JeffreyH
Super Mario Bros (X16 port)	JeffreyH
STNICCC demo	JeffreyH
Wolf3D demo	JeffreyH
...	...
...	...
...	...

These all work as expected. This is a good indication that this feature is indeed backwards compatible and most if not all software / demos / games have the bit 1 and 2 (of $9F22) set to 0.

Note: this change to VERA has also been implemented for the X16 emulator (see PR by MooingLemur: https://github.com/commanderx16/x16-emulator/pull/470) and has shown the same behavior.

Analysis and research into VERA inner workings

Analysis of (parts of) VERA's inner workings has been performed in order to determine what would be needed to create this feature and what the effects would be on the other parts of VERA. The diagrams below show the (visual) results of the research:

VERA_diagrams-vram_if v VERA_diagrams top v_with_multibyte_write_and_copy

jburks commented 1 year ago

Need some more detail regarding the write patterns: while some of them are obvious, e.g. 1111 for contiguous fill and 1010 and 0101 for tile updates, some patterns such as 1101 are not. Would it be better to reserve some of the patterns for future use?

jburks commented 1 year ago

What is the resource impact (e.g. additional LUT4s used) for adding this capability?

jburks commented 1 year ago

It feels like there needs to be a formal review process for features like this to avoid unnecessary feature creep and ensure we aren't getting away from the original "retro" vision by adding too many advanced capabilities. This particular addition doesn't bother me since it feels like what advanced 8-bit hardware might have had as a steppingstone between the CPU doing everything and a full-blown blitter like the Amiga had. That said, we need people in the room who aren't microarchitecture experts and therefore are not likely to review this as part of the decision-making process.

visual-trials commented 1 year ago

Need some more detail regarding the write patterns: while some of them are obvious, e.g. 1111 for contiguous fill and 1010 and 0101 for tile updates, some patterns such as 1101 are not. Would it be better to reserve some of the patterns for future use?

I have extended the use cases (see above) explaining/showing 9 multibyte useful patterns. Adding the default 4 single byte patterns to that (and the blit usecase) means there are 14 (out of 16) useful patterns. The two others are not easily used for anything else. So it makes sense just to complete the set of possible byte patterns.

blinkdog commented 1 year ago

In the description of the issue, you've got this written as the 4-bytes-at-once example:

lda #1          ; Set address to % 4 == 1
sta $9F20
lda #0
sta $9F21

lda #%00110110  ; Set increment to 4 and bits 1 and 2 to 11b
sta $9F22

lda #42
sta DATA0       ; Write color value 4 bytes at a time
sta DATA0
sta DATA0
sta DATA0
sta DATA0
sta DATA0

This example uses the (ADDRESS % 4 == 1) and Register Bits 11, which according to the table is the -+++ pattern.

It seems the pattern you'd want to use for this example is ++++, which corresponds to (ADDRESS % 4 == 0) and Register Bits 10 from the table.

So is the table wrong, or the example wrong, or have I misunderstood both really badly?

mooinglemur commented 1 year ago

@blinkdog good catch.

The example is wrong. The four-byte write used to be at wrpattern ($9F22 bits 1 and 2) 11b, address % 4 == 1, but today @visual-trials moved it to wrpattern 10b, address % 4 == 0. It seems he might have missed updating the example.

mooinglemur commented 1 year ago

I've updated my fork/branch of the emulator to match the current behavior of this PR. Builds can be found here https://github.com/mooinglemur/x16-emulator/actions/runs/4029536711

visual-trials commented 1 year ago

In the description of the issue, you've got this written as the 4-bytes-at-once example:
lda #1          ; Set address to % 4 == 1
sta $9F20
lda #0
sta $9F21

lda #%00110110  ; Set increment to 4 and bits 1 and 2 to 11b
sta $9F22

lda #42
sta DATA0       ; Write color value 4 bytes at a time
sta DATA0
sta DATA0
sta DATA0
sta DATA0
sta DATA0
This example uses the (ADDRESS % 4 == 1) and Register Bits 11, which according to the table is the -+++ pattern.

It seems the pattern you'd want to use for this example is ++++, which corresponds to (ADDRESS % 4 == 0) and Register Bits 10 from the table.

So is the table wrong, or the example wrong, or have I misunderstood both really badly?

Thanks! I fixed this in the example code.

visual-trials commented 1 year ago

Here is the reasoning being the choices made in the mapping:

This has been discussed in discord, but I though it might be a good idea to post it here as well.

visual-trials commented 1 year ago

There has been a complete rewrite of this, so this is old and can be removed.

visual-trials commented 1 year ago

Closing

fvdhoef / vera-module