facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.41k stars 1.11k forks source link

Change FunctionSignature returnType to std::optional to support the return type varied with config #10467

Closed jinchengchenghh closed 2 weeks ago

jinchengchenghh commented 1 month ago

Description

In spark functions, there are many functions whose return datatype are computed by input type and varied with config. For current case, we always implement as SpecialForm, such as decimal_round, make_decimal. But it has several shortages:

  1. The name must be unique, so the function name for other types is add, but it should be decimal_add;
  2. The implementation is different and more complex than other common vector functions;
  3. The aggregate function does not have corresponding SpecialForm;

Why we should make returnType optional?

  1. The function identity should only contains name and input types
    
    auto functionWithMetadata = getVectorFunctionWithMetadata(
            call->name(),
            inputTypes,
            getConstantInputs(compiledInputs),
            config))

auto simpleFunctionEntry = simpleFunctions().resolveFunction(call->name(), inputTypes)) {

2. The return type is only used in test code, we can construct `CallTypedExpr` directly in test code and call `evaluate` to test.;
3. For aggregate function, we can also construct `CallTypedExpr`  and set 

core::AggregationNode::Aggregate agg;

agg.call = std::dynamic_pointer_cast<const core::CallTypedExpr>(
    inferTypes(untypedExpr.expr));

4. The function with simple result type can still use initial logic.

https://github.com/facebookincubator/velox/pull/10383
jinchengchenghh commented 1 month ago

What do you think? @mbasmanova @rui-mo @liujiayi771

mbasmanova commented 1 month ago

@jinchengchenghh It is a bit difficult to understand this issue. Would you illustrate the points you are making with detailed examples?

liujiayi771 commented 1 month ago

@jinchengchenghh Your point is that in some functions, the output is determined by the input, and it is not possible to accurately represent the result type in the signature. I also encountered situations in decimal agg where the result type couldn't be accurately expressed, and using UNKNOWN would lead to not finding the corresponding agg function.

jinchengchenghh commented 1 month ago

What's your commits and what error do you meet? Looks like Agg function only consider the function name to find the Agg function. @liujiayi771 We can also consider to change it to get the Entry by name and input types like the vector function.

exec::SignatureBinder binder(*signature, inputTypes);
          if (binder.tryBind()) {
            auto inputArgs = toVectorFunctionArgs(inputTypes, constantInputs);

            return {
                {entry.factory(sanitizedName, inputArgs, config),
                 entry.metadata}};
          }
const AggregateFunctionEntry* FOLLY_NULLABLE
getAggregateFunctionEntry(const std::string& name);
jinchengchenghh commented 1 month ago

https://stackoverflow.com/questions/2322736/what-is-the-difference-between-function-declaration-and-signature

A function declaration is the prototype for a function (or it can come from the function definition if no prototype has been seen by the compiler at that point) - it includes the return type, the name of the function and the types of the parameters (optionally in C).

A function signature is the parts of the function declaration that the compiler uses to perform overload resolution. Since multiple functions might have the same name (ie., they're overloaded), the compiler needs a way to determine which of several possible functions with a particular name a function call should resolve to. The signature is what the compiler considers in that overload resolution. Specifically, the standard defines 'signature' as:

the information about a function that participates in overload resolution: the types of its parameters and, if the function is a class member, the cv-qualifiers (if any) on the function itself and the class in which the member function is declared.

Note that the return type is not part of the function signature. As the standard says in a footnote, "Function signatures do not include return type, because that does not participate in overload resolution".

jinchengchenghh commented 1 month ago

I draft a PR to solve this issue, https://github.com/facebookincubator/velox/pull/10487

This PR https://github.com/facebookincubator/velox/pull/10383/ can also benefits from this solution.

Can you help to take a look? And let me known if we should do this. @mbasmanova

mbasmanova commented 1 month ago

@jinchengchenghh Before proceeding with the changes, would you please explain the issue with detailed examples? See https://github.com/facebookincubator/velox/issues/10467#issuecomment-2228611788

zhztheplayer commented 1 month ago

If I remember it correctly, there were some cases among aggregate functions that same intermediate argument type produces different return types. See avg as an example.

And given that intermediate types are identified as input types while resolving, it's actually a case in Velox that return type has to be known for looking up the correct function.

I'm not sure but if aggregate function is the only reason that return type is needed during resolving, I would suggest to do some refactors against aggregate functions' resolution procedure to avoid using intermediate type as input type. Speaking of SQL, when a function call is being planned, query planner should always derive its return type (and intermediate types if it's a aggregate function) using the data types of arguments. So in theory Velox could follow the same principle to find out the right function implementation without requiring for extra information.

liujiayi771 commented 1 month ago

@zhztheplayer I also believe that the agg func should be resolved based on the input type rather than the intermediate type. There is still one issue with Spark decimal avg that remains unresolved, which is that we cannot infer the result type through the intermediate type.

In Spark decimal avg, when the input type is DECIMAL(p, s), the intermediate data's sum type is DECIMAL(p + 10, s), and the final returned average type is DECIMAL(p + 4, s + 4). For intermediate type DECIMAL(p + 10, s), it is possible that when p + 10 exceeds 38, we can't calculate the resultType through the intermediateType anymore. However, both can be determined by the raw input type.

jinchengchenghh commented 1 month ago

For some decimal functions like round decimal, the result type is determined by input argument value, so it is registered as SpecialForm before. For decimal arithmetic functions like add, divide, subtract, multiply, the result type is determined by runtime config.

Return type is not used to look up functions in resolving for vector functions, it is only used to evaluate string such as round(decimalVal, scale) to construct CallTypedExpr in test cases. If test the CallTypedExpr directly, we don't need the resultType in function signature.

And it is confused there are two ways to compute resultType, one is signature to parse, one is code. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DecimalArithmetic.cpp#L634 https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DecimalArithmetic.cpp#L415

https://github.com/facebookincubator/velox/pull/10487#issuecomment-2235206986

mbasmanova commented 1 month ago

@jinchengchenghh

In SQL, there are functions and special forms.

A function has a name, a list of argument types and a return type. There can be multiple functions with the same name and different number or types of arguments. The return type of a function can be fully resolved from the list of argument types. It is helpful to support "static" return type resolution and not require instantiating executable functions. For example, one can register a function in Velox, extract its signature from Velox registry, pass it to SQL planner / analyzer and enable queries that use this function without having that function defined within SQL planner / analyzer itself (e.g. one can have C++-only functions that are not defined in Spark Scala or Presto Java code but can be used with these engines). Type resolution also allows writing standalone tests in Velox that use complex expressions and query plans. See PlanBuilder APIs that take SQL strings. Constructing typed expression trees by hand doesn't scale. Notice that Fuzzer also uses function registry to automatically construct valid expression trees. This would not be possible if function registry didn't include the return types.

To evaluate a function, the engine first evaluates all inputs, then invokes the function.

Special forms are similar to functions, but they allow for more flexibility in type resolution or execution. Planner / analyzer needs to know about all special forms as these often have different syntax and do not support "static" type resolution.

Return type for a special form may not be resolvable from just the list of argument types. For example, return type for CAST needs to be specified explicitly. Both CAST(integer as varchar) and CAST(integer as double) take a single argument of type integer, but have different return types: varchar vs. double. Another example, is that return type for round(expr, targetScale) Spark function depends on the value of the second argument, not just its type.

In addition, special forms may not evaluate all the inputs to produce a result. For example, COALESCE is a special form because it evaluates arguments one at a time and once it found an argument that evaluates to non-NULL it returns without evaluating the rest of the arguments. IF is similar. If 'condition' evaluates to true, only 'then' argument is evaluated.

Velox supports both functions and special forms. Functions are required to provide a way to statically resolve return type from input types. Special forms do not have that requirement and they resolve return types dynamically at runtime.

the result type is determined by runtime config.

This case means that there are 2 different functions and config decides which one is used in a given setup. The API to register functions (velox::functions::sparksql::registerFunctions) can take a "config" and use it to decide which function to register. Alternatively, the register API may register 2 functions with different names and plan converter can map the function to one or the other based on config.

And it is confused there are two ways to compute resultType, one is signature to parse, one is code.

FYI, we have extended Simple Function API to support functions with decimal inputs and results and re-implemented Presto functions to use it. This makes it a lot simpler. It might be helpful to do the same for Spark functions.

https://github.com/facebookincubator/velox/pull/9096

CC: @rui-mo

jinchengchenghh commented 1 month ago

In Gluten, we will not generate QueryConfig when register functions. Can I add a public function void registerFunctions(const std::string& prefix, core::QueryConfig& config) to register the function by config, And register the functions with default config in registerFunctions(const std::string& prefix).

void registerFunctions(const std::string& prefix);

void registerFunctions(const std::string& prefix, core::QueryConfig& config);
jinchengchenghh commented 1 month ago

SimpleFunction cannot get resultPrecision and resultScale, I need it in computation process. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DecimalArithmetic.cpp#L380 Do you mind changing it to SpecialForm?

mbasmanova commented 1 month ago

@jinchengchenghh I suggest to change the existing registerFunctions to add a new non-optional parameter that specifies the config for registration. This parameter should be a new struct, not core::QueryConfig, and it should contain only values / flags that influence registration. Something like this:

void registerFunctions(const std::string& prefix, const SparkFunctionRegistrationConfig& config);

SparkFunctionRegistrationConfig name isn't great. We can iterate on it during code review.

SimpleFunction cannot get resultPrecision and resultScale, I need it in computation process.

I'm not sure I understand this question. Please, take a look at Presto functions implementation and provide more details about the problem you are facing if you still have questions.

CC: @rui-mo

jinchengchenghh commented 1 month ago

Thanks for your explain. SparkFunctionRegistrationConfig is good for me. There is still a concern, the core::QueryConfig is used to construct the function like Addtion while SparkFunctionRegistrationConfig is used to register it, this config for allowPrecisionLoss have the same semantics, the user should register the right SparkFunctionRegistrationConfig and right core::QueryConfig at the same time.

Preto function is different with Spark function, I have read the Presto functions, it used simple function to implement, but for Spark function, we need result scale, for example

 r = reduceScale(
            TResult(aRescaled + bRescaled), std::max(aScale, bScale) - rScale)

https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/DecimalArithmetic.cpp#L380

jinchengchenghh commented 1 month ago

I can also change it to simple function, compute the result precision and result scale by input types, but this problem still exists.

And it is confused there are two ways to compute resultType, one is signature to parse, one is code.

If you think it is ok, I will change it to simple functions. Then we can only use SparkFunctionRegistrationConfig to control the behavior

jinchengchenghh commented 1 month ago

I tried to change it to simple function and use config to register. but the overwrite logic has a problem. It throws:

[ RUN      ] DecimalArithmeticTest.notAllowPrecisionLoss
/mnt/DP_disk1/code/velox/velox/vector/tests/utils/VectorTestBase.cpp:151: Failure
Value of: expected->equalValueAt(actual.get(), i, i)
  Actual: false
Expected: true
at 0: expected 0.1232100, but got 0.0123210
unknown file: Failure
C++ exception with description "Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Found incompatible return types for 'multiply' (DECIMAL(38, 7) vs. DECIMAL(38, 10)) for input types (DECIMAL(20, 5), DECIMAL(20, 5)).
Retriable: False
Expression: resultType->equivalent(*simpleFunctionEntry->type().get())
Function: compileRewrittenExpression
File: /mnt/DP_disk1/code/velox/velox/expression/ExprCompiler.cpp
Line: 426
Stack trace:
Stack trace has been disabled. Use --velox_exception_user_stacktrace_enabled=true to enable it.
" thrown in the test body.

I register all the functions with allowPrecisionLoss true in SparkFunctionBaseTest, and in a single test, I register allowPrecisionLoss false to try to overwrite the last function.

When I register allowPrecisionLoss false, the FunctionSignature std::unordered_map<std::string, SignatureVariable> variables is different, so we cannot overwrite the previous functions. The SignatureMap includes https://github.com/facebookincubator/velox/blob/main/velox/expression/SimpleFunctionRegistry.cpp#L43 I try to print the function signature by

for (auto& [key, value] : signatureMap) {
      std::cout << "signature " << key.toString() << std::endl;
      std::cout << "signature variables size" << key.variables().size()
                << std::endl;
      for (auto& [name, variable] : key.variables()) {
        std::cout << "variable name " << name << " variable constraints"
                  << variable.constraint() << std::endl;
      }
    }

    std::cout << "current signature" << metadata->signature()->toString()
              << std::endl;
    for (auto& [name, variable] : metadata->signature()->variables()) {
      std::cout << "variable name " << name << " variable constraints"
                << variable.constraint() << std::endl;
    }

The print result is

signature variables size6
variable name i7 variable constraints(i1 + i2 + 1) <= 38 ? (i5 + i6) : max((i5 + i6) - (i1 + i2 + 1) + 38, min((i5 + i6), 6))
variable name i6 variable constraints
variable name i5 variable constraints
variable name i3 variable constraintsmin(38, i1 + i2 + 1)
variable name i2 variable constraints
variable name i1 variable constraints
current signature(decimal(i1,i5),decimal(i2,i6)) -> decimal(i3,i7)
variable name i7 variable constraints(i5 + i6) <= 38 ? (i5 + i6) : 38
variable name i6 variable constraints
variable name i5 variable constraints
variable name i3 variable constraintsmin(38, i1 + i2 + 1)
variable name i2 variable constraints
variable name i1 variable constraints

It confirms the signature is different for variable name i7, so we have two FunctionEntry in FunctionMap. When we try to compile the expressions,

 auto simpleFunctionEntry =
            simpleFunctions().resolveFunction(call->name(), inputTypes)) {

We will get a random one FunctionEntry from FunctionEntry with precision loss and without precision loss. The solution is:

  1. We don't allow register decimal functions with same name and input types with different result type at session level, user who register it should guarantee this.
  2. Change the decimal arithmetic functions to SpecialForm, then we could use QueryConfig to control it.
  3. Register function key only consider input argument types.

The problem occurs because we register the function considering result type for key FunctionSignature in the map, but get the function by SignatureBinder only consider argTypes to bind, and physical type is always HUGEINT, both can match to the condition.

I would prefer the second approach. It will be easy to control which Function is really used and don't need to compute result precision and result scale by signature and code. The config allowPrecisionLoss mode is a session config in Spark, the semantic is similar to QueryConfig.

What do you think? @mbasmanova

CC @rui-mo

mbasmanova commented 1 month ago

The problem occurs because we register the function considering result type for key FunctionSignature in the map,

@jinchengchenghh It seems a bug in the function registry. It should only consider name and input types and not allow registering functions that differ only in result type. It would be nice to create a separate GitHub issue about this and fix it. CC: @kagamiori @kgpai @bikramSingh91

mbasmanova commented 1 month ago

There is still a concern, the core::QueryConfig is used to construct the function like Addtion while SparkFunctionRegistrationConfig is used to register it, this config for allowPrecisionLoss have the same semantics, the user should register the right SparkFunctionRegistrationConfig and right core::QueryConfig at the same time.

We should not configure same behavior in 2 different places. From what I understand so far, allowPrecisionLoss should be defined only in SparkFunctionRegistrationConfig, not in core::QueryConfig.

I register all the functions with allowPrecisionLoss true in SparkFunctionBaseTest, and in a single test, I register allowPrecisionLoss false to try to overwrite the last function.

To workaround the bug in the registry, you may register functions in the test with different prefixes.