comse6998 / spring2024

Repository for COMSE6998 in the Spring 2024 term
2 stars 20 forks source link

ZDOTC goes into infinite loop sometimes #163

Open joseemoreira opened 4 months ago

joseemoreira commented 4 months ago

./zdotc 1 1 1 && ./run zdotc.tr 1000 executes correctly, producing

zdotc.tr (# of architected instr = 43, # of speculative instr = 32, # of operations duspatched = 41, # of operations issued = 41, # of operations completed = 41, # of cycles = 90)

but

./zdotc 1 -1 1 && ./run zdotc.tr 1000

produces

zdotc.tr (# of architected instr = 47, # of speculative instr = 24, # of operations duspatched = 44, # of operations issued = 30, # of operations completed = 30, # of cycles = 1000)

It looks like operations are stuck in the issue queue. You can set debugging = true to get some more run-time information.

Anirudh-1149 commented 4 months ago

Thanks for the analysis. I will look further into this issue.

Anirudh-1149 commented 4 months ago

The problem is still that a register is erased before it has been used by a previous operation. In zdotc, the for loop starts with 4 operations

rdjki(9, 1, 5) // X7 = MEM[X1 (x) + X5 (ix)] (x real) rdjki(10,3,6) // X8 = MEM[X3(y) + X6 (iy)] (y real) isjkj(1, 1) // X1 (x) = X1 (x) + 1 isjkj(3, 1) // X3 (y) = X3 (y) + 1

The read operations and the addition operations can happen independently. But If register 1 is deleted before rdjki(9,1,5) uses the register 1, then the read operation will be stuck forever. In my simulations logical and the physical register 1 are the same.

This is precisely, what is happening. The operation rd(9,1,5) is wating for logical register 5 to be ready. This takes time because incx=-1 and few operations are needed to calculate the logical register 5.

In the mean time the operations isjkj(1,1) utilizes the register 1. And the register 1 is deleted before the operation rdjki(9,1,5) uses the register 1. And the

Anirudh-1149 commented 4 months ago

I added an new debugging log for erasing the registers. From the below logs we can verify the above simulation phenomenon.

Logical register 5 points to the physical register 33. Logical register 1 points to the physical register 1. logical register 9 points to the physical regsiter 4. 0007.000f.b3.0004.0001.0021.000000 is the agen instruction for the operation rdjki(9, 1, 5).

Testing inputs for operation 0007.000f.b3.0004.0001.0021.000000, F = 179, j = 1 (1), k = 33 (0) Testing inputs for operation 0007.0010.25.0005.0004.0000.000000, F = 37, j = 4 (0) Testing inputs for operation 0008.0015.b3.001a.0008.0021.000000, F = 179, j = 8 (1), k = 33 (0) Testing inputs for operation 0008.0016.25.0013.001a.0000.000000, F = 37, j = 26 (0) Testing inputs for operation 0008.0017.b3.0024.0023.001b.000000, F = 179, j = 35 (1), k = 27 (1) Testing inputs for operation 0008.001a.13.0022.0023.0000.000001, F = 19, j = 35 (1) Testing inputs for operation 0009.001d.a0.0028.0013.001e.000000, F = 160, j = 19 (0), k = 30 (0) Testing inputs for operation 0009.001e.80.001f.0027.0028.000000, F = 128, j = 39 (0), k = 40 (0) CQ[1] : completing operation 0007.0014.12.0023.0003.0000.000001 Erasing the register : 1 34 | 0000.00000000.0000.00000000 | 0000.00.0000.0000.0000.000000 | 0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000.000a.0020.80.0029.001d.0004.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0000.0000.00.0000.0000.0000.000000 | 0007.0014.12.0023.0003.0000.000001 cycle 34 : (# of instr = 47) Testing inputs for operation 0007.000f.b3.0004.0001.0021.000000, F = 179, j = 1 (0), k = 33 (0) Testing inputs for operation 0007.0010.25.0005.0004.0000.000000, F = 37, j = 4 (0) Testing inputs for operation 0008.0015.b3.001a.0008.0021.000000, F = 179, j = 8 (1), k = 33 (0) Testing inputs for operation 0008.0016.25.0013.001a.0000.000000, F = 37, j = 26 (0) Testing inputs for operation 0008.0018.25.001e.0024.0000.000000, F = 37, j = 36 (0) Testing inputs for operation 0009.001b.a0.0026.0005.0007.000000, F = 160, j = 5 (0), k = 7 (0) Testing inputs for operation 0009.001c.80.0027.001c.0026.000000, F = 128, j = 28 (1), k = 38 (0) Testing inputs for operation 000a.001f.a0.0004.0005.001e.000000, F = 160, j = 5 (0), k = 30 (0) Testing inputs for operation 0008.001a.13.0022.0023.0000.000001, F = 19, j = 35 (1) OI[0] : issuing operation 0008.0017.b3.0024.0023.001b.000000

Anirudh-1149 commented 4 months ago

I am looking into why the register 1 is being erased.

Anirudh-1149 commented 4 months ago

The operation isjkj(1,1) is reaching the CO stage before the operation rdjki(9, 1, 5) is starting. This out of order commit is erasing the physical register 1 and leading to a dealock in the code.

I found a similar issue in zdotu. Where the pysical register 1 is deleted and the read operation is stalled for a while. But physical register 1 is reassigned to another operation's target later in the execution. This operation is setting the register 1 to ready and the read moves forward.

joseemoreira commented 4 months ago

Thanks for the thorough investigation. I changed the CO stage to commit in order. Please verify that it works now. Also, if you align the branches, the number of instructions should match.

Anirudh-1149 commented 4 months ago

The issue has been resolved. Thanks for the fix.