Reconsider the usage of global temporary RTL registers

FluorineDog commented 5 years ago

My suggestion is that instead of defining global at t0 t1 t2 and praying for no conflicts, we should encourge local definitions, which means any needed intermediate "registers" should be put into the corresponding function scope. e.g.

/// rtlreg_t at; 
// ...
void interpret_rtl_addi(const rtlreg_t* dest, const rtlreg_t* src, int imm){
   /// rtl_li(&at, imm);
   /// rtl_add(dest, src, &at); 
   rtlreg_t imm_at; 
   rtl_li(&imm_at, imm)
   rtl_add(dest, src, &imm_at); 
}

I expect some arguments like "this is simulation of hardware behavior". However, I'm somewhat familiar with Verilog, and I don't think there's a counterpart of RTL temporary registers like at.

Maybe it is a concept borrowed from MIPS, but AFAIK they are not sharing lots of similarities.

As we know, a graph of IC is a combination of synchronized circuits and combinatorial circuits, with the latter stateless and the former stateful.

Obviously, due to C language semantics, a globally defined at rtl register is more like a part of global state. However, the usage of temporary RTL registers(e.g. the "at" in rtl_li(&at, imm)), are basically something like wire in verilog, which is combinatorial, thus stateless.

What's more, when we refer to intel manual, the defnition of temporary registers is quite arbitrary, instead of sticking to some fixed set of temporary registers. Take near relative CALL in 64bit as an example:

tempDEST ← SignExtend(DEST); (* DEST is rel32 *)
tempRIP ← RIP + tempDEST;
IF stack not large enough for a 8-byte return address
THEN #SS(0); FI;
Push(RIP);
RIP ← tempRIP;

Here tempXXXs are local and stateless. Though we can use global registers like at t0 t1 to simulate this behavior, it is not intuisive and invites unexpected bugs.

In conclusion, we should encourge local definitions in function scopes, especially RTL ones. It will benifit code readability and robustness dramatically.

sashimi-yzh commented 5 years ago

Actually, we plan to prepare code from which students can be easy to step to JIT (just-in-time compilation) implementation. But local definitions make JIT hard to work.

Also, the term RTL in NEMU is closer to IR (intermediate representation) in compiler than hardware.

If you are not interested in JIT optimization, it is OK to introduce local definitions as you mention above. But this is not our plan. So please feel free to choose your implementation.

FluorineDog commented 5 years ago

Well, i'm not quite sure about the JIT things, though I do have some experience in LLVM. AFAIK, the LLVM IR sets no limit for the number of registers(or variables), though I can't tell whether it's convenient to use stack addresses as tags for code generation until I get hands on it.

Anyway, thanks for your kind explanation.

jiangyy commented 5 years ago

A few more explanations: local states do complicate the specification of the IR. Existing IRs: • RTL (NEMU): easy to interpret, like assembly, can easily do basic optimizations. • SSA (like LLVM): good for compiler optimizations, difficult to analyze. • Stack (like JVM): compact, easy to interpret, but (slightly more) annoying for compiler optimizations.

So you see the trade-offs we made here: we get something like assembly, which is easy to both implement and do some basic optimizations: • If you’re doing NEMU, RTL makes life (slightly) easier. SSA/Stack looks a good choice. • If you’re doing an out-of-order processor to interpret your IR in hardware circuits, RTL is the only practical choice. Think of doing this on SSA or Stack. • If you’re intended to do serious compiler optimizations, SSA is certainly the choice (OpenJDK did this).

FluorineDog commented 5 years ago

@sashimi-yzh @jiangyy

I DO NOT think the "maybe easiness" of static, global RTL registers deserves the painful debugging efforts, even if JIT IS TO BE IMPLEMENTED. Instead, using the functional-like coding style mentioned above is always benificial in engineering practice.

Talk is cheap. I've got my hands on JIT since I finished PA4 last weekend. Right now, I've successfully implemented an almost-fully-functioning (only the hardware interrupt is on the todo list) NEMU supporting JIT. Considerable speeds up is observed.

Due to personal preference, this project is based on C++17 and LLVM 8 with RTTI enabled, managed by CMake (though the original makefile also works.)

Feel free to visit https://github.com/FluorineDog/hust-pa

sashimi-yzh commented 5 years ago

The original issue of using a local variable is that, we can not access it during the optimization pass in the future.

I glance briefly at your code. The key idea of your implementation is that, with the value_cache, you map every accesses of the local variables to a unique Value object in LLVM. In this way, you avoid the issue mentioned above, since the optimization pass will be based on the Value objects, instead of the address of the local variables. But after this mapping, what you actually get is something like SSA, since every writes to the same rtlreg_t is mapped to different Value objects by the LLVM APIs, such as CreateAnd(), CreateOr().

You implement JIT successfully in one week, and obtain considerable speedup. This is mainly achieved by LLVM. You first translate the RTL to the correct SSA, then let LLVM perform the optimization for you. It is not surprising that you get such a good result.

Now comes our plan. We also want students to implement their own JIT engine to perform the optimization. This can give them a chance to explore the following:

How to perform some basic optimizations?
How to implement a simple JIT engine?

Note that such goals are out of the scope of the original ICS course. In fact, they are somehow related to the compiler course. This JIT task can guide students to think about the questions above before taking the compiler course, and they may come up with their own ideas to these questions. These ideas may be right, but they may also be odd, or hard to implement, or even wrong. But the most important, they do explore. During the compiler course, students will learn the optimization techniques systematically. At that time, they will rethink about "why this is better" and "why the previous ideas are wrong". This is a good way of learning.

If your goal is to obtain the best performance, or to get runnable JIT code as soon as possible, it is recommended to call LLVM for help as your case. But this is not our plan. To let students come up their own ideas easily, we should provide a simple enough IR. Currently your case can not prove the "easiness" we want, since your code does not contain any optimization ideas of yours.

Note that, again, such goals are out of the scope of the original ICS course. This is the reason why the JIT part of NEMU is optional, and the reason why I put the JIT part at a low priority (you can see that the lecture note of this part is still incomplete). An undergraduate student in NJU and I have tried to use the RTL to perform some basic optimizations. It looks good. But to make the JIT part mature, it is still a long way (maybe in five years?). If you are interested in implementing your own JIT engine (either RTL-based or SSA-based is ok), welcome to share your experience.

===============

Now let us go back to the original issue about using local variables in RTL. One advantage of global RTL registers is that they can be accessed by the JIT code directly. By using local variables, you should somehow map them into some addresses which the JIT code can access. But if you want to do this, a better way is to dynamically manage the RTL temporal variables. When you need a new RTL temporal variable, you allocate one from a global array. When you do not need it anymore, just free it. Note that the global array is accessible from the JIT code.

You have emphasized a lot the pain of debugging. I can not imagine how painful you are. According to the feedback from the NJU students, bugs caused by unexpected overwriting of global RTL registers are not too often. But there are still some ways to improve: (1) Dynamic management of the RTL temporal variables mentioned above. (2) Further divide global RTL registers into several hierarchical classes. By obeying these conventions, the probability of unexpected overwriting will decrease.

at - only use for rtl_li
t0, t1... - only use as temporal variables inside pseudo RTL
s0, s1... - only use as temporal variables inside the implementation of guest instructions (the top level which can use RTL instructions)
cpu.eax, cpu.ebx... - use globally and across different guest instructions

(3) Increase the number of global RTL registers. By providing enough RTL registers, the probability of unexpected overwriting will also decrease. (4) If a pseudo RTL needs to use a temporal variable, take it as an extra argument. In this way, the caller of such a pseudo RTL can explicitly decide which RTL register is going to be used as the temporal variable. This can somehow expose the unexpected conflict usage of RTL registers.

Note that

(2) can achieve the effect of (4). Since RTL registers of different classes have different life time, they can be reused again once they are out of their corresponding scopes. Therefore, if we already have (2), then (4) is not necessary, since if we obey (2), any RTL registers in the right class is ok for (4), and the caller do not need to worry about which one is to choose.
(2) + (3) can achieve the effect of (1). On one hand, by dynamic management of the RTL temporal variables, we also consider the temporal variables as a different class from other RTL registers. On the other, providing enough RTL registers can achieve the illusion of unlimited amount of RTL registers. As mentioned above, RTL registers of different classes can be reused again once they are out of their corresponding scopes. Therefore, we do not need to explicitly allocate and free them.
(1) and (3) may affect the performance of JIT code in the future. In the JIT code implementation, we can directly assign some RTL registers to the host CPU registers. In this way, the performance of JIT code will be improved. But if we have more RTL registers, then less of them can be assigned to the host CPU registers, since the number of host CPU registers is limited. But how much impact (1) and (3) will introduce is still not clear.
QEMU uses (1), (3) and (4). The IR which QEMU uses is called TCG. It is a kind of IR like RTL used in NEMU. In fact, to design NEMU, we borrow a lot of ideas from QEMU, and then think about how to simplify them for teaching propose. QEMU does not use local variables as you suggest. I think the discussion above has covered the reason.

Therefore, in the next version of NEMU, I am going to try (2), with the minimal of (3).

Welcome to further discuss the trade-offs among these choices.

jiangyy commented 5 years ago

Is it possible to schedule an offline discussion? Looks like that @FluorineDog is not in Nanjing (or Beijing, which I'm going a few times a year).

FluorineDog commented 5 years ago

As you can see, the primal goal of the JIT here is to speed up the emulator, instead of playing around with the backend of compiler (which, I'm afraid, seems to be what you are doing).

You say I let LLVM help me to do all the optimization. This statement is right and wrong. It is right because the LLVM optimization pass is indeed impressive. It is wrong because the most of the speedup is due to instruction cache itself(here is in the form of compiled code), which saves lots of costly memory loads, especially when VM is enabled.

For the offline discussion, I'm also anticipating it actually. Maybe 15th will be fine for me, if I can persuade my teacher to put off the upcoming deadline of the report, AND have someone to afford the ticket from Nanjing to Wuhan (no need for a reverse one, since I will happen to visit Shanghai at that time...)

Free free to add my Wechat Fluorine_Dog, since it seems to be a better tool to communicate, at least than Github Issuses...

Michael1015198808 commented 5 years ago

As you can see from their name, at, t0, t1, t2, t3 are two types of temporary register. What Mr. Yu means is that

You should only use t[0123] in EHelper
You should use at only in RTL
Once the RTL function is finished, the value of at will be considered uninitialized. (In other words, each RTL function should initialize the value of at and can not suppose that at is calculated by other functions) Since this program is running in single thread, it only runs at most one EHelper and one RTL at one time. So this won't introduce any bug if you truly write code according to the two principles above.

sashimi-yzh commented 5 years ago

@FluorineDog

For speeding up, another way is to use KVM. It will give you native performance, but it may also reduce the fun of building system which works.

You remind me that some people may not be interested in compiler backend. It will be better to provide three choices for optimization:

use KVM to get native performance, but with less fun
let LLVM do the backend to get reasonable performance improvement, with fun but less effort
write the backend by themselves to get reasonable performance improvement, know how things work but with more effort

It is interesting to design a framework to support these choices simultaneous. I will try to do this in the future, but in a long-term plan.

For your suggestion about local RTL registers, I am still not going to try it. If you do not have any further comments, I will close this issue.

FluorineDog commented 5 years ago

@sashimi-yzh But will KVM support running a program on a different architecture(like emulating an x86 program on IBM Power9 chips) and be able to set breakpoints easily? I just believe that virtual machine and emulator is solving different problems.

As for global vs. local RTL registers, even if assembly code is to be emitted by hand, we can still map local RTL registers stack address(which can be seen as a “tag” to distinguish different local variables) to some memory buffer, by using some naive algorithm (e.g., assign a new memory location for every new address encountered). This will free the students from manually doing the “register graph coloring” pass when implementing the RTL pseudo-code.

NJU-ProjectN / nemu

Reconsider the usage of global temporary RTL registers #6