cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.03k stars 622 forks source link

Proposal: Change `codegen::lang::If` to hide PHI and force scoping. #917

Open phisiart opened 6 years ago

phisiart commented 6 years ago

I discussed with Tony about this. We think we should post an issue here for further discussion.

TLDR

Provide a different codegen::lang::If that

The interface should look like this:

llvm::Value *codegen::lang::If2(
    CodeGen &cg,
    llvm::Value *condition,
    const std::function<llvm::Value *()> &cgen_then_branch,
    const std::function<llvm::Value *()> &cgen_else_branch);

Background

Since LLVM uses the SSA form, we must explicitly use a phi node to merge 2 different versions of a value. For example, instead of writing

if condition:
  printf("then branch")
  a = 1
else:
  printf("else branch")
  a = 2

We have to write

if condition:
  printf("then branch")
  a1 = 1
else:
  printf("else branch")
  a2 = 2
a = merge(a1, a2) # This is the PHI node in LLVM

More specifically, to actually generate the code above, we need to use codegen::lang::If in this way:

llvm::Value *condition = ...;
codegen::lang::If cond{cg, condition};
cg.CallPrintf("then branch", {});
llvm::Value *a1 = cg.Const32(1);
cond.ElseBlock();
cg.CallPrintf("else branch", {});
llvm::Value *a2 = cg.Const32(2);
cond.EndIf();
llvm::Value *a = cond.BuildPHI(a1, a2);

The above looks a bit messy, so we usually add scoping like this:

llvm::Value *condition = ...;
llvm::Value *a1;
llvm::Value *a2;
codegen::lang::If cond{cg, condition};
{
  cg.CallPrintf("then branch", {});
  a1 = cg.Const32(1);
}
cond.ElseBlock();
{
  codegen::CallPrintf("else branch");
  a2 = cg.Const32(2);
}
cond.EndIf();
llvm::Value *a = cond.BuildPHI(a1, a2);

The problems of the current codegen::lang::If are:

For example, the user might (mistakenly) directly use a2:

llvm::Value *condition = ...;
llvm::Value *a1;
llvm::Value *a2;
codegen::lang::If cond{cg, condition};
{ ... }
cond.ElseBlock();
{ ... }
cond.EndIf();
<directly use a2 here>

Or, the user doesn't understand PHI nodes but coincidentally uses it correctly:

llvm::Value *condition = ...;
llvm::Value *a1;
llvm::Value *a2;
codegen::lang::If cond{cg, condition};
{ ... }
cond.ElseBlock();
{ ... }
cond.EndIf();
a2 = cond.BuildPHI(a1, a2);
<directly use a2 here>

Proposal

What if we have a different codegen::lang::If that has an interface like this:

llvm::Value *codegen::lang::If2(
    CodeGen &cg,
    llvm::Value *condition,
    const std::function<llvm::Value *()> &cgen_then_branch,
    const std::function<llvm::Value *()> &cgen_else_branch);

Then we can use it like this:

llvm::Value *condition = ...;
llvm::Value *a = codegen::lang::If2(cg, condition, [&] {
  cg.CallPrintf("then branch", {});
  return cg.Const32(1);
}, [&] {
  cg.CallPrintf("else branch", {});
  return cg.Const32(2);
});

Also we will need another interface for multiple returns:

std::vector<llvm::Value *> codegen::lang::If2(
    CodeGen &cg,
    llvm::Value *condition,
    const std::function<std::vector<llvm::Value *>()> &cgen_then_branch,
    const std::function<std::vector<llvm::Value *>()> &cgen_else_branch);

Using the same idea for Loops

Suppose we have the following mental model for loops:

loop_vars := init_loop_vars
while (condition(loop_vars)): # the condition depends on the current state
  loop_vars := update(loop_vars)

Then we could have a codegen::lang::Loop that has the following interface:

using Values = std::vector<llvm::Value *>;
std::vector<llvm::Value *> codegen::lang::Loop2(
    CodeGen &cg,
    const Values &init_loop_vars,
    const std::function<llvm::Value *(const Values &)> &cgen_condition,
    const std::function<Values(const Values &)> &cgen_update
);

For example, suppose we want to create the following loop:

int x = 0;
for (int i = 0; i < 10; i++) {
  if (i % 2 == 0) {
    x++;
  }
}
// We want to use x later.
llvm::Value *x = codegen::lang::Loop2(
    cg,

    // init_loop_vars
    {/*i=*/cg.Const32(0), /*x=*/cg.Const32(0)},

    // cgen_condition
    [&](const std::vector<llvm::Value *> &curr_loop_vars) {
      llvm::Value *curr_i = curr_loop_vars[0];
      return cg->CreateICmpULT(curr_i, cg.Const32(10));
    },

    // cgen_update
    [&](const std::vector<llvm::Value *> &curr_loop_vars) {
      llvm::Value *curr_i = curr_loop_vars[0];
      llvm::Value *curr_x = curr_loop_vars[1];

      llvm::Value *next_i = cg->CreateAdd(curr_i, cg.Const32(1));
      llvm::Value *next_x = codegen::lang::If2(
          cg,
          cg->CreateURem(curr_i, cg.Const32(2)),
          [&] { return cg->CreateAdd(curr_x, cg.Const32(1)); },
          [&] { return curr_x; }
      );

      return {next_i, next_x};
    }
)[1];
pmenon commented 6 years ago

Thanks for putting this together and submitting it! Originally, lang::If was meant only for control flow, not to control assignment. Over time, this evolved into what we now.

What you really want is a selection of values based on a condition. LLVM already has this, to a limited extent; to implement your example, you'd write:

llvm::Value *condition = ...
llvm::Value *a = codegen->CreateSelect(condition, codegen.Const32(1), codegen.Const32(2));

However, if the logic in each case is more complex, you don't want to do that. But, your proposal is insufficient for two reasons:

  1. It isn't intuitive. The great thing with lang::If is that it looks like C code, making it familiar. The intention of If2 is not clear from just looking at it.
  2. It suffers from the same potential misuse as lang::If. For multi-valued returns, you rely on the caller placing results in the same order in each branch. How do you deal with values that are lambda-captured outside the if?

I propose that you implement a beefed up version of LLVM's select as lang::Select:

class Select {
 public:
  // Every branch produces a Result - a map of variables to their values
  using Result = std::unordered_map<std::string, llvm::Value*>;

  // The constructor takes in the set of variables that this selection produces.
  // This is used to verify that each branch produces values for every variable
  Select(CodeGen &codegen, std::vector<std::string> vars);

  // Add a single condition to the selection
  void AddCondition(llvm::Value *cond, std::function<Result()> producer);

  // Add default condition
  void AddDefault(std::function<Result()> producer);

  // Evaluate the selection returning the final result
  Result Eval();
};

This hides the PHI-nodes, can be implement to ensure correct use (I think), and is objectively clear in intention. It may be explicit, but with more people working on the system it's better to be verbose and clear than to use clever tricks.

seojungmin commented 6 years ago

I think we should have CreateSelect abstracted, and implemented in our engine. We can see from many IR outputs, such that quite a number of ifs are translated to select. This will help the decision making on how we are going to refactor if.

ruobing-wang commented 6 years ago

It is good to know Select is a better implementation. I can modify my PR to add Select abstraction. I am a student who just work on this project recently and don't have much experience on the engine. Thank you for your ideas!

seojungmin commented 6 years ago

@MalcolmUnWang If you are going to add Select, please make it as a new, separate PR.

seojungmin commented 6 years ago

and the first place you could apply it to is the case operator.

ruobing-wang commented 6 years ago

@seojungmin I am wondering where I can view the IR outputs like you said. I am working on the select but having trouble testing my implementation. Thank you.

seojungmin commented 6 years ago

@MalcolmUnWang No problem. This is the line that you could print out the IRs created.
https://github.com/cmu-db/peloton/blob/a70c7d4cd295243bf7b326b2ef17aa0bcf15e9e5/src/codegen/code_context.cpp#L221 Just change LOG_TRACE to LOG_INFO. You could also use codegen.CallPrintf() to print any values at runtime, for debugging. It is just like a printf.