facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.51k stars 1.15k forks source link

Output column names for Substrait are incorrect #5820

Open westonpace opened 1 year ago

westonpace commented 1 year ago

Bug description

This discussion started in #3543 and, as requested, I am opening a new issue for this. I think @whutjs summarized the problem well here:

@vibhatha We have a Calcite plan, which is

LogicalProject(_cluster=[$0], tag1=[$2], tag2=[$3], dc=[$1], $_field=[$4])
  LogicalAggregate(group=[{0, 1, 2, 3}], $_field=[map_union_sum($4)])
    LogicalFilter(condition=[AND(=($0, 'Tesla'), SEARCH($2, Sarg['0', '1']:CHAR(1)), =($3, '0'), =($1, 'fake_sandbox'))])
      LogicalTableScan(table=[[test, _default]])

And the corresponding Substrait plan is this

Base on the plan, we know that the projection names are:

    names: "_cluster"          // ordinal: 0
    names: "tag1"                //  ordinal: 1
    names: "tag2"               //  ordinal: 2
    names: "dc"                  //   ordinal : 3
    names: "$_field"           //   ordinal: 4

Before this change, the output of this plan would be like:

before

But after this change, the output change to this:

after

System information

I apologize, but I have no good way of running Velox plans on my system at the moment. I am fairly certain that system info is not relevant to this bug. However, if needed, I can dump this information.

Relevant logs

No response

westonpace commented 1 year ago

I have read through the original issue more carefully. I believe that both the old output and the new output have the wrong column names. The old output has the column names n3_0, n3_1, n3_2, n3_3. The new output has the column names n3_5, n3_6, n3_7, n3_8, n3_9. The plan specifies that the output column names should be: _cluster, tag1, tag2, dc, $_field.

I am fairly certain the issue is related to this TODO in the substrait consumer code:

https://github.com/facebookincubator/velox/blob/main/velox/substrait/SubstraitToVeloxPlan.cpp#L549

core::PlanNodePtr SubstraitVeloxPlanConverter::toVeloxPlan(
    const ::substrait::RelRoot& root) {
  // TODO: Use the names as the output names for the whole computing.
  const auto& names = root.names();
  if (root.has_input()) {
    const auto& rel = root.input();
    return toVeloxPlan(rel);
  }
  VELOX_FAIL("Input is expected in RelRoot.");
}

The column names are in the variable names but this variable is not used.

whutjs commented 1 year ago

@westonpace hi, is it possible to add a projectionNames to the ProjectRel in Substrait?

image

It would be very easy to get the output names right in velox if there are projectionNames in ProjectRel. In fact, that's how we implement it now in our internal codes:

core::PlanNodePtr SubstraitVeloxPlanConverter::toVeloxPlan(
    const ::substrait::ProjectRel& projectRel) {
  ...
  // Ignore codes not relevant
  ...

  // The actual projectionNames 
  const auto & actualProjectionNames = projectRel.names();
  // Then, adding project expression related project names and expressions.
  for (const auto& expr : projectExprs) {
    expressions.emplace_back(exprConverter_->toVeloxExpr(expr, inputType));
    if (colIdx < projectRel.names_size()) {
      projectNames.emplace_back(actualProjectionNames.at(colIdx));
    } else {
      projectNames.emplace_back(
          substraitParser_->makeNodeName(planNodeId_, colIdx));
    }
    colIdx += 1;
  }

   ...
  // Ignore codes not relevant
   ...
}

And now we can get the correct output names in velox:

image
westonpace commented 1 year ago

Ok, I think I understand some of the challenges better.

  1. In Velox, there is no way to specify the output column names other than a project.
  2. In Velox, field references must use column names. There is no way to refer to a field by index.

What about adding a dummy project node at the end of the plan to rename the columns?

core::PlanNodePtr SubstraitVeloxPlanConverter::toVeloxPlan(
    const ::substrait::RelRoot& root) {
  // TODO: Use the names as the output names for the whole computing.
  const auto& names = root.names();
  if (!root.has_input()) {
    VELOX_FAIL("Input is expected in RelRoot.");
  }
  const auto& rel = root.input();
  core::PlanNodePtr rootNode = toVeloxPlan(rel);
  auto outputType = rootNode->outputType();
  if (root.names_size() == 0) {
    return rootNode;
  }
  if (root.names_size() != outputType->size()) {
    VELOX_FAIL(
        "Names in plan did not match the number of fields in the output type.");
  }
  std::vector<std::string> columnAliases;
  std::vector<core::TypedExprPtr> columnRefs;
  for (uint32_t fieldIdx = 0; fieldIdx < outputType->size(); fieldIdx++) {
    columnRefs.push_back(std::make_shared<core::FieldAccessTypedExpr>(
        outputType->childAt(fieldIdx)));
    columnAliases.push_back(root.names(fieldIdx));
  }
  return std::make_shared<core::ProjectNode>(
      nextPlanNodeId(),
      std::move(columnAliases),
      std::move(columnRefs),
      rootNode);
}
vibhatha commented 1 year ago

Could this get overridden by another project?

westonpace commented 1 year ago

Could this get overridden by another project?

No, this is at the root node. There are no operators beyond this.

whutjs commented 1 year ago

Can we be sure that the projection names will always show up in the root node?

vibhatha commented 1 year ago

Let me work on a draft and evaluate this further. I think Weston's suggestion is valid and the Substrait plan will most certainly have the output names in the root. If that's holding, this should be a probable way to solve it.

@westonpace What do you think?

westonpace commented 1 year ago

Can we be sure that the projection names will always show up in the root node?

Yes. In Substrait, this is the only place we can be certain that names will show up. From the substrait spec:

In Substrait, all fields are dealt with on a positional basis. Field names are only used at the edge of a plan, for the purposes of naming fields for the outside world.

It is possible that a producer does not include names in the plan. In that case I would suggest that the consumer reject the plan as invalid or use meaningless names and assume the producer does not care what the columns are named.

It is possible to add names to a node / expression using extensions (I believe there is someone attempting to do exactly this to get a lossless roundtrip with Spark plans) however a consumer should not expect these to be present. Otherwise they will only be able to consume plans from that one producer.

It would be possible to alter the official spec (not just an extension) to add names as described but that has already been proposed once or twice before and turned down because names in the middle of a plan are redundant and this adds a fair amount of complexity.

@westonpace What do you think?

One note about my proposed solution is that it will not work if there are nested columns (columns that are themselves row vectors or contain row vectors). In that case the Substrait plan will include names for all of the fields (in DFS order). The approach I sketched above would raise an error in this situation (Names in plan did not match the number of fields in the output type.) which is probably good enough for a start. It is not clear to me if Velox supports naming nested row vector fields or not (from a glance it appears that it does). If Velox does support naming nested row vector fields then there needs to be some way to rename these (e.g. I'm not sure SQL alias syntax is sufficient so I assume there is some other mechanism). If Velox does not support naming nested row vector fields (or chooses not to offer a way to rename them) then the consumer should skip the nested names or ideally verify that they match what Velox has come up with automatically.

Although, I do think just raising an error in this situation (nested columns) is a good first step.