Closed visual-trials closed 1 year ago
Need some more detail regarding the write patterns: while some of them are obvious, e.g. 1111
for contiguous fill and 1010
and 0101
for tile updates, some patterns such as 1101
are not. Would it be better to reserve some of the patterns for future use?
What is the resource impact (e.g. additional LUT4s used) for adding this capability?
It feels like there needs to be a formal review process for features like this to avoid unnecessary feature creep and ensure we aren't getting away from the original "retro" vision by adding too many advanced capabilities. This particular addition doesn't bother me since it feels like what advanced 8-bit hardware might have had as a steppingstone between the CPU doing everything and a full-blown blitter like the Amiga had. That said, we need people in the room who aren't microarchitecture experts and therefore are not likely to review this as part of the decision-making process.
Need some more detail regarding the write patterns: while some of them are obvious, e.g.
1111
for contiguous fill and1010
and0101
for tile updates, some patterns such as1101
are not. Would it be better to reserve some of the patterns for future use?
I have extended the use cases (see above) explaining/showing 9 multibyte useful patterns. Adding the default 4 single byte patterns to that (and the blit usecase) means there are 14 (out of 16) useful patterns. The two others are not easily used for anything else. So it makes sense just to complete the set of possible byte patterns.
In the description of the issue, you've got this written as the 4-bytes-at-once example:
lda #1 ; Set address to % 4 == 1
sta $9F20
lda #0
sta $9F21
lda #%00110110 ; Set increment to 4 and bits 1 and 2 to 11b
sta $9F22
lda #42
sta DATA0 ; Write color value 4 bytes at a time
sta DATA0
sta DATA0
sta DATA0
sta DATA0
sta DATA0
This example uses the (ADDRESS % 4 == 1)
and Register Bits 11
,
which according to the table is the -+++
pattern.
It seems the pattern you'd want to use for this example is ++++
,
which corresponds to (ADDRESS % 4 == 0)
and Register Bits 10
from the table.
So is the table wrong, or the example wrong, or have I misunderstood both really badly?
@blinkdog good catch.
The example is wrong. The four-byte write used to be at wrpattern ($9F22 bits 1 and 2) 11b, address % 4 == 1, but today @visual-trials moved it to wrpattern 10b, address % 4 == 0. It seems he might have missed updating the example.
I've updated my fork/branch of the emulator to match the current behavior of this PR. Builds can be found here https://github.com/mooinglemur/x16-emulator/actions/runs/4029536711
In the description of the issue, you've got this written as the 4-bytes-at-once example:
lda #1 ; Set address to % 4 == 1 sta $9F20 lda #0 sta $9F21 lda #%00110110 ; Set increment to 4 and bits 1 and 2 to 11b sta $9F22 lda #42 sta DATA0 ; Write color value 4 bytes at a time sta DATA0 sta DATA0 sta DATA0 sta DATA0 sta DATA0
This example uses the
(ADDRESS % 4 == 1)
andRegister Bits 11
, which according to the table is the-+++
pattern.It seems the pattern you'd want to use for this example is
++++
, which corresponds to(ADDRESS % 4 == 0)
andRegister Bits 10
from the table.So is the table wrong, or the example wrong, or have I misunderstood both really badly?
Thanks! I fixed this in the example code.
Here is the reasoning being the choices made in the mapping:
This has been discussed in discord, but I though it might be a good idea to post it here as well.
There has been a complete rewrite of this, so this is old and can be removed.
Closing
-- This feature has been made possible due to the help of many others on the X16 discord channel, most notably: MooingLemur, Xark, Wavicle (jburks) and Yazarchy. Many thanks to them! --
Writing and copying multiple bytes at once
This change allows for multibyte writing and copying of VRAM data. Its main purpose is to increase performance.
Register usage
This features uses two unused bits: Bit 1 and 2 of the ADDRx_H register, called Write pattern.
How it works
The number of bytes that are written at once to VRAM can be changed by setting the two bits in Write pattern. In the default mode (both bits set to 0) only one VRAM byte is changed when a byte is written to DATA0 or DATA1. This is the byte that exactly corresponds to the address written to. When the Write pattern is set differently multiple bytes (normally with the same value) are written to VRAM at the same time.
Up to 4 bytes can be written to VRAM at once, aligned to a 32-bit address. Which of the 4 bytes are overwritten can be controlled: all possible patterns of bytes written to VRAM can be set by combining the 2 bits of Write pattern and the lower 2 bits of the address that is written to. Only the pattern that writes nothing to VRAM cannot be set.
Below is the mapping:
($9F22)
FIXME Update the text below to the new pattern layout!
Note: Each byte pattern is consists of four +'s and -'s. Each - or + represents a byte in VRAM. A + means the byte is written to and a - means the byte in VRAM is untouched.
Blitting
A special combination of the lower 2 bits of the address and the 2 bits in Write pattern is used to signify copying of (up to) 4 bytes at the same time (aka "blitting"). This is when address % 4 == 0 and the 2-bit pattern is 11b. This works as follows:
This effectively allows for copying 4 bytes at the same time from VRAM to VRAM.
Masked blitting
It is also possible to blit parts of the 32-bit cache. The value written to VERA (using a
sta DATA0
orsta DATA1
in the above step 2) is namely used as an inverted nibble mask. For example: writing the value11000011b
to DATA0 in blit mode will mean that only the two middle bytes of a 32-bit value are copied.Setting a bit to 0 in this inverted mask will make sure the corresponding nibble will be copied. Setting a bit to 1 will let a nibble be untouched during the blit. The Least Significant Bit (LSB) will affect the nibble with the lowest VRAM address (the left most pixel) and the MSB will affect the nibble with the highest VRAM address (the right most pixel).
Note: the above feature (multibyte writing and copying) is limited to main Video RAM: $00000-$1F9BF
Example code
In effect you can write 4 bytes of VRAM at the same time using a single 6502
sta
command (after you setup the registers correctly):To copy VRAM to VRAM quickly you can do use an
lda
andsta
(with a "blit-setting" configured):Use cases
Here are a few use cases:
This shows the speed difference (on real HW) of filling/clearing the screen (source):
This show the speed difference (on real HW) of copying bitmap data (source):
Backwards compatibility and HW testing
This feature uses 2 bits that were previously unused. If set to 00b it is completely backwards compatible, since it will do basic 1 byte writes to VRAM. Only if the bits are set differently will the behaviour change of VERA.
This features has been tested for backwards compatability for the following X16 application/demos/games:
These all work as expected. This is a good indication that this feature is indeed backwards compatible and most if not all software / demos / games have the bit 1 and 2 (of $9F22) set to 0.
Note: this change to VERA has also been implemented for the X16 emulator (see PR by MooingLemur: https://github.com/commanderx16/x16-emulator/pull/470) and has shown the same behavior.
Analysis and research into VERA inner workings
Analysis of (parts of) VERA's inner workings has been performed in order to determine what would be needed to create this feature and what the effects would be on the other parts of VERA. The diagrams below show the (visual) results of the research: