X16Community / vera-module

Versatile Embedded Retro Adapter
MIT License
14 stars 5 forks source link

Prepare for FX update: refactor of video_modulator.v for 1 DSP, 30+ LUTs and less timing pressure #2

Closed visual-trials closed 1 year ago

visual-trials commented 1 year ago

Introduction

This is a change to video_modulator.v to free up resources needed for the FX update. Most notably a single DSP.

It has the following effect:

Background

The current version of VERA uses 8 DSPs:

These latter calculations (y_s, i_s and q_s) use multiplications with constants. Since constant-multiplication can be replaced by additions the synthesizer can optimize away many of the 9 multiplications. In fact in the current version of VERA six of these multiplications are implemented using LUTs. Three of them are implemented using one DSP each.

Using both multipliers in a DSP

It is important to note that a DSP contains two independent 8x8 multipliers with independent 16 bit outputs. So at the moment only half of each of the above mentioned three DSPs is utilized.

In order to make use of both multipliers in the DSP the primitive pmi_dsp is used (as this seems to be the only way of letting the synthesizer use both 8x8 multipliers in a DSP). It is used in unsigned 8x8 form, so with these settings:

TOPOUTPUT_SELECT = 2'b10
BOTOUTPUT_SELECT = 2'b10
A_SIGNED = 1'b0
B_SIGNED = 1'b0

For more information see: DSP Function Usage Guide for iCE40 Devices

These are the 4 multiplications we use the two DSPs for:

    // We set up two DSPs for 4 of the 9 multiplications

    wire [15:0] Y_G_times_g_16, I_G_times_g_16;
    wire [15:0] Y_R_times_r_16, I_R_times_r_16;

    // We use one DSP for two 8x8 unsigned multiplications
    video_modulator_mult_u8xu8_pair video_modulator_mult_ig_yg (
        .clk(clk),

        // I_G_times_g = I_G * g
        .input_1a_8(I_G[7:0]),
        .input_1b_8({4'b0000, g}),
        .output_1_16(I_G_times_g_16),

        // Y_G_times_g = Y_G * g
        .input_2a_8(Y_G[7:0]),
        .input_2b_8({4'b0000, g}),
        .output_2_16(Y_G_times_g_16)
    );

    // We use another DSP for two 8x8 unsigned multiplications
    video_modulator_mult_u8xu8_pair video_modulator_mult_yr_ir (
        .clk(clk),

        // Y_R_times_r = Y_R * r
        .input_1a_8(Y_R[7:0]),
        .input_1b_8({4'b0000, r}),
        .output_1_16(Y_R_times_r_16),

        // I_R_times_r = I_R * r
        .input_2a_8(I_R[7:0]),
        .input_2b_8({4'b0000, r}),
        .output_2_16(I_R_times_r_16)
    );

Explicitily adding instead of multiplying

The remaining 5 constant-multiplications (of the 9) are implemented by using additions of several left-shifted values. Here is the code that preps these numbers:

    // We need these five differently shifted values to replace the remaining multiplications by additions

    wire [4:0] g_times_2  = { g, 1'b0 };
    wire [9:0] g_times_64 = { g, 6'b000000 };

    wire [4:0] b_times_2  = { b, 1'b0 };
    wire [6:0] b_times_8  = { b, 3'b000 };
    wire [8:0] b_times_32 = { b, 5'b00000 };

Here is the code that puts all multiplications together:

    // We put together all the 9 multiplication results (all unsigned so far)

    wire [11:0] Y_R_times_r = Y_R_times_r_16[11:0];   // Y_R_times_r = Y_R * r
    wire [11:0] Y_G_times_g = Y_G_times_g_16[11:0];   // Y_G_times_g = Y_G * g
    wire [11:0] Y_B_times_b = b_times_8 + b_times_2;  // Y_B_times_b = Y_B * b and since Y_B is 10 (8+2), Y_B_times_b = 8*b + 2*b

    wire [11:0] Q_R_times_r = Y_R_times_r;            // Q_R_times_r = Q_R * r and since Q_R is equal to Y_R, Q_R_times_r = Y_R_times_r
    wire [11:0] Q_G_times_g = g_times_64 + g_times_2; // Q_G_times_g = Q_G * g and since Q_G is 66 (64+2), Q_G_times_g = 64*g + 2*g
    wire [11:0] Q_B_times_b = b_times_32 + b_times_8; // Q_B_times_b = Q_B * b and since Q_B is 40 (32+8), Q_B_times_b = 32*b + 8*b

    wire [11:0] I_R_times_r = I_R_times_r_16[11:0];   // I_R_times_r = I_R * r
    wire [11:0] I_G_times_g = I_G_times_g_16[11:0];   // I_G_times_g = I_G * g
    wire [11:0] I_B_times_b = Q_B_times_b + b;        // I_B_times_b = I_B * b and since I_B is 41 (32+8+1), I_B_times_b = Q_B_times_b + 1*b

By doing it this way 9 multiplications are implemented using 2 DSPs and several additions (implemented in LUTs). This means a single DSP has become available.

Unsigned vs signed

Using unsigned numbers instead of signed numbers makes it easier to replace multiplications with additions. The multiplications with a constant that contain the least amount of 1's are chosen here, since they require the least amount of additions.

In order to accomplish multiplying with constants that are negative, they are initially negated:

    parameter I_G = 35; // -0.2746 (this should actually be -35, so *after* multiplication the result is negated)
    parameter I_B = 41; // -0.3213 (this should actually be -41, so *after* multiplication the result is negated)
    parameter Q_G = 66; // -0.5227 (this should actually be -66, so *after* multiplication the result is negated)

And when the multiplications are added together they are negated again (meaning: they are subtracted), like so:

                y_s <= Y_R_times_r + Y_G_times_g + Y_B_times_b + (128 + 512);
                i_s <= I_R_times_r - I_G_times_g - I_B_times_b;                 // Effectively negating I_G and I_B here
                q_s <= Q_R_times_r - Q_G_times_g + Q_B_times_b;                 // Effectively negating Q_G here

This allows for the multiplications to be completely equivalent with the original implementation.

Also note that r, g and b are 4-bit numbers (not 8 bit). And all of the constants are effectively 7-bit (unsigned) numbers. This means that after multiplying, the resulting number cannot be more than 11 bits long and bit 11 cannot be set to 1 (which would be interpreted as a negative number).

Timing pressure

Right now the inputs of the r_s, g_s and b_s signals are chosen/muxed before multiplication. In other words: for each of the 9 multiplications there is a mux deciding whether the multiplier should multiply an actual signal (r, g or b) or simply a 0. Here is the code that did that:

    wire signed [4:0] r_s = color_burst ? 5'd9 : {1'b0, r};
    wire signed [4:0] g_s = color_burst ? 5'd9 : {1'b0, g}; 
    wire signed [4:0] b_s = color_burst ? 5'd0 : {1'b0, b};

These muxes have the color_burst signal as their select line. But this means that the multiplier has to 'wait' for the color_burst signal to be ready. This turns out to be putting a lot of timing pressure on the inputs of the flipflops y_s, i_s and q_s. This is because there is quite a long path between hcnt (in video_composite.v) and color_burst. And also because DSPs are usually located further away (they are positioned only at certain places on the chip) and the way back-and-forth takes longer than with (local) LUTs.

This timing pressure can be alleviated by checking for the color_burst signal after/independent of the multiplications: only when y_s, i_s and q_s are assigned is there a decision whether this is done by multiplication and additions of only constants (which is optimized to assigning to a constant) or by the addition of the results of several multiplications. The multiplier does not have to wait for the color_burst signal anymore.

This means the code needed to be refactored a bit. So this code:

    wire signed [4:0] r_s = color_burst ? 5'd9 : {1'b0, r};
    wire signed [4:0] g_s = color_burst ? 5'd9 : {1'b0, g}; 
    wire signed [4:0] b_s = color_burst ? 5'd0 : {1'b0, b};

   ...

    always @(posedge clk) begin
        y_s <= (sync_n_in == 0) ? 12'd0 : 12'd544;
        i_s <= 0;
        q_s <= 0;

        if (active) begin
            y_s <= (Y_R * r_s) + (Y_G * g_s) + (Y_B * b_s) + (128 + 512);
        end

        if (active || color_burst) begin
            i_s <= (I_R * r_s) + (I_G * g_s) + (I_B * b_s);
            q_s <= (Q_R * r_s) + (Q_G * g_s) + (Q_B * b_s);
        end
    end

is replaced by this code:

    always @(posedge clk) begin

        case ({active, color_burst})
            2'b00: begin
                y_s <= (sync_n_in == 0) ? 12'd0 : 12'd544;
                i_s <= 0;
                q_s <= 0;
            end
            2'b01: begin
                y_s <= (sync_n_in == 0) ? 12'd0 : 12'd544;
                i_s <= (I_R * 5'd9) - (I_G * 5'd9) - (I_B * 5'd0);
                q_s <= (Q_R * 5'd9) - (Q_G * 5'd9) + (Q_B * 5'd0);
            end
            2'b10: begin
                y_s <= Y_R_times_r + Y_G_times_g + Y_B_times_b + (128 + 512);
                i_s <= I_R_times_r - I_G_times_g - I_B_times_b;                 // Effectively negating I_G and I_B here
                q_s <= Q_R_times_r - Q_G_times_g + Q_B_times_b;                 // Effectively negating Q_G here
            end
            2'b11: begin
                y_s <= (Y_R * 5'd9) + (Y_G * 5'd9) + (Y_B * 5'd0) + (128 + 512);
                i_s <= (I_R * 5'd9) - (I_G * 5'd9) - (I_B * 5'd0);
                q_s <= (Q_R * 5'd9) - (Q_G * 5'd9) + (Q_B * 5'd0);
            end
        endcase

    end

Checking and testing

It is important to check whether the changes really do not cause any changes in behavior. This can be done analytically and empirically. Reviewing the code methodically will certainly help confidence. Testing in real hardware will do as well.

Note: the author (JeffreyH) does not have the hardware to practically test s-video or composite video on a monitor/tv (or measure the output signals), so it would be very helpful (and prudent) to do this before accepting this pull request and merge into the fx branch.

When testing on real hardware: you can check the version number of VERA to see if you run v0.1.2 instead of v0.1.1.