Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.84k stars 2.94k forks source link

Support more sophisticated table and partition bypassing options #13667

Open dbw9580 opened 3 years ago

dbw9580 commented 3 years ago

During review of #13581, @HelloHorizon suggested it should be made more user-friendly to configure bypassed partitions:

Different from the number of tables in the database, there can be tens of thousands of partitions in single tables. Can we also come up with an easy way for user to configure the bypass partitions, like bypass major part of partitions?

@maobaolong offered insights from real use cases:

With the bypass table and partition feature, we can use it to bypass some older tables or partitions and mount a small range of tables or partitions to alluxio.

It just like a rolling forward window which stand for a range of active valuable dataset(partitions), the window can move forward by a time unit, let's say it is a week, so we just want alluxio cache the current week partition, and bypass the older partitions, if some query need the older partition, it will read data from under filesystem directly rather than read it from alluxio.

Describe the solution you'd like

  1. allow use of glob, regex, etc. to match partitions when specifying which partitions to bypass.
  2. introduce some kind of filter functions that are invoked on table sync and determine which tables and partitions should be bypassed.

For example:

{
  "bypass": {
    "tables": [
      {
        "table": "table1",
        "partitions": [
          "table1_part1",
          {"filter": "regex", "regex": "table1_part(2|3|4)", "on": "name"},
          {"filter": "topN", "top_n": 100, "on": "column(id)"}
        ]
      }
    ]
  }
}

Describe alternatives you've considered

Open to discussion.

Urgency

Normal

dbw9580 commented 3 years ago

@apc999 FYI.

maobaolong commented 3 years ago

@dbw9580 Thanks for this invention. Please check whether your existing approach support a way to describe only include one or a specific set of tables do not being bypassed, but bypassed others.

dbw9580 commented 3 years ago

After discussion with @maobaolong and @HelloHorizon, the priority is to support regex and exclusion.

I propose the revised and simplified design of the config file:

  1. tables and partitions can now accept either a list of names, or an object with an include and/or an exclude key, containing table or partition names to include and exclude from the bypass list. The old syntax is still supported, and defaults to an include list. I.e.,

    {
      "bypass": {
          "tables": [
              "table1",
              {"table": "table2", "partitions": ["part1", "part2"]}
          ]
      }
    }

    is equivalent to

{
    "bypass": {
        "tables": {
            "include": [
                "table1",
                {
                    "table": "table2", 
                    "partitions": {
                        "include": ["part1", "part2"]
                    }
                }
            ]
        }
    }
}
  1. when both include and exclude lists are present, an additional optional option includeFirstOnConflict can be used to resolve conflicts:

    {
      "bypass": {
          "tables": {
              "exclude": [ 
                  "table1"
              ],
              "include": [
                  {"regex": "^table\\d"}
              ],
              "includeFirstOnConflict": true
          }
      }
    }

    In the above example, table1 will not be excluded from bypassed tables, i.e. table1 is bypassed from Alluxio, and clients read it directly from UDB.

    This option defaults to true.

  2. a new way to match tables and partitions names by regular expression:

    {
      "bypass": {
          "tables": [
              {"regex": "^table\\d"},
              {"table": "table2", "partitions": [{"regex": "^part\\d"}]}
          ]
      }
    }

    This config will bypass (all partitions of) table0, table1, and table3 to table9. For table2, partitions part0 to part9 will be bypassed, any other partitions, if any, will not.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.