apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

[multistage] attach default limit 10 to queries that doesn't have a limit #12304

Open walterddr opened 7 months ago

walterddr commented 7 months ago
dttung2905 commented 7 months ago

Hi @walterddr,

I'm just following the issue and I think applying the limit by default is a really good idea too. Although I'm quite new to the pinot code base, I think the starting point is the setLimit() method https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-common/src/main/java/org/apache/pinot/sql/parsers/CalciteSqlParser.java#L495-L499 I have a question. How do we know if the LIMIT is set in the main query or the sub query as we only want to set default limit to the main query. Could you point me to the right place in the code base ? I think I found the place where we parse the subquery but not sure how they all are connected https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-common/src/main/java/org/apache/pinot/sql/parsers/CalciteSqlParser.java#L510-L519

Would love to create a PR to solve this if you permit :smile:

walterddr commented 7 months ago

Hi @dttung2905 thank you for the interest.

yes this is indeed a relatively good task to get to know the pinot codebase (not necessarily beginner though)

a bit of a explanation: the code you linked are for the v1 engine not for the multi-stage engine. (e.g. the function you linked is to convert SqlNode into PinotQuery. In order to achieve similar effect in the multi-stage engine:

  1. look for where both engine utilize the same logic to parse the SQL string into SqlNodeAndOptions https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/MultiStageBrokerRequestHandler.java#L105-L115

  2. look for where the query is being converted into an execution plan (compileQuery method) https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java#L144-L148

  3. you can see that the compilation runs several steps including validation, relational planning and optimizer. any one of these steps can be used to check the fetch limit

If this is too complicated, please feel free to check in again (i added the beginner task tag but decided to remove it as it is not as straightforward as I original thought of)

dttung2905 commented 7 months ago

Thanks @walterddr for the detailed pointer. I take a look at the code and I think validate stage might be a good place to implement that check, something like

  private SqlNode validate(SqlNode parsed, PlannerContext plannerContext)
      throws Exception {
    // 2. validator to validate.
    SqlNode validated = plannerContext.getValidator().validate(parsed);
    // Check for LIMIT clause in the main query
    if (validated instanceof SqlSelect) {
      SqlSelect sqlSelect = (SqlSelect) validated;
      if (sqlSelect.getFetch() == null) {
        // Add LIMIT 100 clause 
        SqlParserPos originalPos = sqlSelect.getParserPosition();
        SqlNode limitNode = SqlLiteral.createExactNumeric("100", originalPos);
        ((SqlSelect) validated).setFetch(limitNode);
      }
    }
    if (null == validated || !validated.getKind().belongsTo(SqlKind.QUERY)) {
      throw new IllegalArgumentException(
          String.format("unsupported SQL query, cannot validate out a valid sql from:\n%s", parsed));
    }
    return validated;

What do you think about this? Not sure if there are other places to change ( probably change test cases too )

walterddr commented 7 months ago

this will work only in happy-path. it is not guarantee that the validated top is a always SqlSelect for normal queries. but it is a good starting point. ( for example it could be a SqlSetOperator if the top is a UNION ) i would suggest try out a whole bunch of queries and see if there's any query patterns that could evade these assumption (those could be a good test cases to be added)