Add the function to generate SASS instruction set sequences with unique kernel_id and PC value

masa-laboratory commented 6 months ago

Add the function to generate SASS instruction set sequences with unique kernel_id and PC value, as someone may not want to see all instructions executed by all warps, but only the sequence of instructions with unique PC value whitin a single kernel. As long as the DUMP_INTERSET switch is turned on, the instruction set instrset.csv can be obtained in the traces/ folder.

An example of traces/instrset.csv:

``` # The following sequence is a list of low-level assembly instructions (SASS), which is actually executed by the real GPU. kernel id, PC, instruction string (opcode dest_num [reg_dests] src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses] immediate) 1, 0000, MOV 1 R1 0 0 0 1, 0010, SHFL.IDX 0 4 R255 R255 R255 R255 0 0 1, 0020, S2R 1 R0 0 0 0 1, 0030, MOV 1 R5 0 0 4 1, 0040, MOV 1 R6 0 0 0 1, 0050, S2R 1 R3 0 0 0 1, 0060, MOV 1 R7 0 0 0 1, 0070, LEA 1 R0 2 R0 R3 0 2 1, 0080, SHF.L.U32 1 R4 2 R0 R255 0 4 1, 0090, MOV 1 R0 1 R255 0 0 1, 00a0, IMAD.WIDE 1 R2 2 R4 R5 0 0 1, 00b0, IMAD.WIDE 1 R4 2 R4 R5 0 0 1, 00c0, MOV 1 R8 1 R2 0 0 1, 00d0, MOV 1 R11 1 R3 0 0 1, 00e0, MOV 1 R2 1 R8 0 0 1, 00f0, LDG.E.SYS 1 R9 1 R4 4 1 0x51700200 64 0 1, 0100, MOV 1 R3 1 R11 0 0 1, 0110, LDG.E.SYS 1 R8 1 R6 4 1 0x51700400 0 0 1, 0120, LDG.E.SYS 1 R10 1 R2 4 1 0x51700a00 64 0 1, 0130, FFMA 1 R9 3 R8 R9 R10 0 0 1, 0140, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 0150, LDG.E.SYS 1 R8 1 R6 4 1 0x51700440 0 0 1, 0160, LDG.E.SYS 1 R10 1 R4 4 1 0x51700204 64 0 1, 0170, FFMA 1 R11 3 R8 R10 R9 0 0 1, 0180, STG.E.SYS 0 2 R2 R11 4 1 0x51700a00 64 0 1, 0190, LDG.E.SYS 1 R8 1 R6 4 1 0x51700480 0 0 1, 01a0, LDG.E.SYS 1 R10 1 R4 4 1 0x51700208 64 0 1, 01b0, FFMA 1 R13 3 R8 R10 R11 0 0 1, 01c0, STG.E.SYS 0 2 R2 R13 4 1 0x51700a00 64 0 1, 01d0, LDG.E.SYS 1 R8 1 R6 4 1 0x517004c0 0 0 1, 01e0, LDG.E.SYS 1 R10 1 R4 4 1 0x5170020c 64 0 1, 01f0, FFMA 1 R9 3 R8 R10 R13 0 0 1, 0200, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 0210, LDG.E.SYS 1 R8 1 R6 4 1 0x51700500 0 0 1, 0220, LDG.E.SYS 1 R10 1 R4 4 1 0x51700210 64 0 1, 0230, FFMA 1 R11 3 R8 R10 R9 0 0 1, 0240, STG.E.SYS 0 2 R2 R11 4 1 0x51700a00 64 0 1, 0250, LDG.E.SYS 1 R8 1 R6 4 1 0x51700540 0 0 1, 0260, LDG.E.SYS 1 R10 1 R4 4 1 0x51700214 64 0 1, 0270, FFMA 1 R13 3 R8 R10 R11 0 0 1, 0280, STG.E.SYS 0 2 R2 R13 4 1 0x51700a00 64 0 1, 0290, LDG.E.SYS 1 R8 1 R6 4 1 0x51700580 0 0 1, 02a0, LDG.E.SYS 1 R10 1 R4 4 1 0x51700218 64 0 1, 02b0, FFMA 1 R9 3 R8 R10 R13 0 0 1, 02c0, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 02d0, LDG.E.SYS 1 R8 1 R6 4 1 0x517005c0 0 0 1, 02e0, LDG.E.SYS 1 R10 1 R4 4 1 0x5170021c 64 0 1, 02f0, FFMA 1 R11 3 R8 R10 R9 0 0 1, 0300, STG.E.SYS 0 2 R2 R11 4 1 0x51700a00 64 0 1, 0310, LDG.E.SYS 1 R8 1 R6 4 1 0x51700600 0 0 1, 0320, LDG.E.SYS 1 R10 1 R4 4 1 0x51700220 64 0 1, 0330, FFMA 1 R13 3 R8 R10 R11 0 0 1, 0340, STG.E.SYS 0 2 R2 R13 4 1 0x51700a00 64 0 1, 0350, LDG.E.SYS 1 R8 1 R6 4 1 0x51700640 0 0 1, 0360, LDG.E.SYS 1 R10 1 R4 4 1 0x51700224 64 0 1, 0370, FFMA 1 R9 3 R8 R10 R13 0 0 1, 0380, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 0390, LDG.E.SYS 1 R8 1 R6 4 1 0x51700680 0 0 1, 03a0, LDG.E.SYS 1 R10 1 R4 4 1 0x51700228 64 0 1, 03b0, FFMA 1 R11 3 R8 R10 R9 0 0 1, 03c0, STG.E.SYS 0 2 R2 R11 4 1 0x51700a00 64 0 1, 03d0, LDG.E.SYS 1 R8 1 R6 4 1 0x517006c0 0 0 1, 03e0, LDG.E.SYS 1 R10 1 R4 4 1 0x5170022c 64 0 1, 03f0, FFMA 1 R13 3 R8 R10 R11 0 0 1, 0400, STG.E.SYS 0 2 R2 R13 4 1 0x51700a00 64 0 1, 0410, LDG.E.SYS 1 R8 1 R6 4 1 0x51700700 0 0 1, 0420, LDG.E.SYS 1 R10 1 R4 4 1 0x51700230 64 0 1, 0430, FFMA 1 R9 3 R8 R10 R13 0 0 1, 0440, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 0450, LDG.E.SYS 1 R8 1 R6 4 1 0x51700740 0 0 1, 0460, LDG.E.SYS 1 R10 1 R4 4 1 0x51700234 64 0 1, 0470, FFMA 1 R11 3 R8 R10 R9 0 0 1, 0480, STG.E.SYS 0 2 R2 R11 4 1 0x51700a00 64 0 1, 0490, LDG.E.SYS 1 R8 1 R6 4 1 0x51700780 0 0 1, 04a0, LDG.E.SYS 1 R10 1 R4 4 1 0x51700238 64 0 1, 04b0, FFMA 1 R13 3 R8 R10 R11 0 0 1, 04c0, STG.E.SYS 0 2 R2 R13 4 1 0x51700a00 64 0 1, 04d0, LDG.E.SYS 1 R8 1 R6 4 1 0x517007c0 0 0 1, 04e0, LDG.E.SYS 1 R10 1 R4 4 1 0x5170023c 64 0 1, 04f0, IADD3 1 R0 2 R0 R255 0 1 1, 0500, ISETP.NE.AND 0 1 R0 0 16 1, 0510, IADD3 1 R6 2 R6 R255 0 4 1, 0520, IADD3.X 1 R7 3 R255 R7 R255 0 0 1, 0530, FFMA 1 R9 3 R8 R10 R13 0 0 1, 0540, STG.E.SYS 0 2 R2 R9 4 1 0x51700a00 64 0 1, 0550, IADD3 1 R8 2 R2 R255 0 4 1, 0560, IADD3.X 1 R11 3 R255 R3 R255 0 0 1, 0570, BRA 0 0 0 224 1, 0580, EXIT 0 0 0 0 ```

JRPan commented 6 months ago

nice feature

lol.

Why not just grep the first warp from the trace file? Is there any difference?
The address looks weird. Shouldn't it be 64-bit?
Is there only 1 warp in the example you show? What happens when there are multiple warps? I don't see warp id referenced in your code.
There is too much duplicate code. Please reuse the original code and create functions if possible.

masa-laboratory commented 6 months ago

nice feature

lol.

Why not just grep the first warp from the trace file? Is there any difference?

The address looks weird. Shouldn't it be 64-bit?

Is there only 1 warp in the example you show? What happens when there are multiple warps? I don't see warp id referenced in your code.

There is too much duplicate code. Please reuse the original code and create functions if possible.

Using grep of the first warp will generate some duplicate instructions. An example:

kernel-1.trace
```shell kernel name = _Z7mat_mulPfS_S_ -kernel id = 1 -grid dim = (4,1,1) -block dim = (4,1,1) -shmem = 0 -nregs = 16 -binary version = 70 -cuda stream id = 0 -shmem base_addr = 0x00007f9310000000 -local mem base_addr = 0x00007f930e000000 -nvbit version = 1.5.5 -accelsim tracer version = 4 -enable lineinfo = 0 #traces format = [line_num] PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses] immediate 2 0 0 0 0000 0000000f 1 R1 MOV 0 0 0 0 0 0 0 0000 0000000f 1 R1 MOV 0 0 0 1 0 0 0 0000 0000000f 1 R1 MOV 0 0 0 3 0 0 0 0000 0000000f 1 R1 MOV 0 0 0 ...... 2 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700200 64 0 0 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700000 64 0 3 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700300 64 0 1 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700100 64 0 ...... 2 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700200 64 0 1 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700100 64 0 3 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700300 64 0 0 0 0 0 00f0 0000000f 1 R9 LDG.E.SYS 1 R4 4 1 0x7f92d5700000 64 0 ```

In this simple case, the kernel _Z7mat_mulPfS_S_ has 4 warps. If we grep the traces of the first warp (ctaid_x=0, ctaid_y=0, ctaid_z=0, warp_id=0), we will get 2 sequences for the PC 0x00f0.

We may only want to obtain the instructions corresponding to a single PC, instrset.csv can be used as a lookup table. The instrset.csv has only one sequence for each unique PC.
But there is a problem. The addresses in the LD/ST instructions executed by each warp may be different, or the immediate values of other instructions may also be different. What I can think of is that users can use instrset.csv to check the instruction opcode, register number, etc., but don't pay too much attention to the address or immediate value. For example, I want to see the instruction opcode corresponding to a certain PC value to determine what execution unit it should be issued to, or I also want to find the register numbers of an instruction to calculate the bank IDs.
There are 4 warps in the above case. As described in 2, instrset.csv is only a lookup table, so there is no warp_id references.
Indeed. the code needs to be improved. The address output format has some issues.

JRPan commented 6 months ago

okay now I understand what you want to do. For PC you can get it with cuobjdump, but not the register. So I guess this can be helpful even though I would probably just grep the inst.

But I cannot accept as it is.

If addr does not matter, then don't print it. It's wrong, and for different instance the addr may be different.
(not 100% sure ) The PC is actually hacky. It's a relative PC. So, it is only true within a function. It is possible that two PC are pointing to two different instructions.

rodhuega commented 6 months ago

Hi,

I have a similar system to the one proposed here (but with different purposes), and I think I can provide useful insights.

I think it is better to record the instruction information inside instrument_function_if_needed with the instr->getSass() than later in recv_thread_fun. In the way I propose, more information is recorded (like immediate or other kinds of registers). Moreover, it will be agnostic to the MREF changes.

Regarding the PC, I agree with JRPan. Inside a kernel (.traceg), there can be equal PCs belonging to different functions. If you want to do it properly, you need to add (int)instr->getOffset() to (uint64_t)nvbit_get_func_addr(f);. However, you may end with some big numbers that look weird. If you can solve that problem, you can have some maps of addr_func and some unique_function_id numbers to make conversions. Later, in the output file you print that association and the unique_function_id, vpc.

By the way, if you also want that the ICache during simulation does not have false hits, you will also have to build the access address with (int)instr->getOffset() + (uint64_t)nvbit_get_func_addr(f);.

Hopefully, I will end someday with the thing that I'm doing in my private repo. and I try to merge it.

accel-sim / accel-sim-framework

Add the function to generate SASS instruction set sequences with unique kernel_id and PC value #299