Open paul-rogers opened 2 years ago
Let's add additional design details as notes to keep the description light.
The catalog is implemented as another table in Druid's metastore DB. Operations are via the REST API. When we support SQL DDL statements, the implementation of those statements will use the same REST API (with some for of user impersonation.)
In other systems, there are separate permissions for table metadata and table data. Druid only has read and write access to each datasource, so we work within those limitations.
The basic security rules for the catalog are:
Druid allows granting permissions on via a regular expression. So, the admin could cobble together a form of private/temporary tables by allowing, say, write access to all tables of the form "
A future enhancement would be to invent a more advanced security model, but that is seen as a task separate from the catalog itself.
First, I want to say, this is a great proposal that lays some foundation work to introduce DDL to Druid.
My question is what's the relationship between current INFORMATION_SCHEMA in Druid and the proposed Catalog here?
@FrankChen021, thanks for your encouragement! Yes, I hope that borrowing a bit of RDBMS functionality we can make Druid a bit easier to use by letting Druid keep track of its own details.
You asked about INFORMATION_SCHEMA, so let's address that.
Revised on June 1.
Druid's INFORMATION_SCHEMA
implements a feature introduced in SQL-92: it is a set of views which presents the same underlying metadata which Druid's SQL engine uses to plan queries. For example, segments for a datasource might have multiple types: long
in one segment, string
in a newer segment. Druid picks one for use in SQL. In this case, it will pick string (presented to SQL as VARCHAR
) since that is what appears in the newest segments.
Since INFORMATION_SCHEMA
is a "view", it is immutable: users cannot modify the INFORMATION_SCHEMA
tables directly. Instead, modifications are done via Druid's APIs and, with this proposal, via the catalog.
In this proposal, we modify the INFORMATION_SCHEMA
results to show the effect of applying catalog information on top of the information gleaned from segments. For example, if the user decided that the above field really should be a long
, they'd specify it as the SQL BIGINT
type, which would override Druid's type inference rules. The type for the column would then appear as BIGINT
in the INFORMATION_SCHEMA.COLUMNS
record for that column.
We may add a "hide" option to columns to mark a column that exists in segments, but, for whatever reason, is unwanted. A hidden column would not appear in INFORMATION_SCHEMA.COLUMNS
since it is not available to SQL.
INFORMATION_SCHEMA
as a Semi-StandardINFORMATION_SCHEMA
originally appeared in SQL-92, based on "repositories" common in the late 80's. Hence, it has a rather dated set of concepts and data types. It appears that most vendors (such as MySQL, Postgres, Apache Drill, etc.) keep the parts they can, modify bits as needed, and add extensions keeping with the 80's vibe of the original design. Postgres goes so far as to define its own logical types that mimic the types used in the SQL-92 spec (such as yes_or_no
).
To maintain compatibility, we retain the standard aspects of INFORMATION_SCHEMA
while removing the bits not relevant to Druid.
According to the SQL-92 spec referenced earlier, all users should be able to see INFORMATION_SCHEMA
entries for all tables, columns and other resources in the database. Druid's security model is more strict: users can only see entries for those tables (or datasources) for which the user has read access. We retain this description, and enforce the same restriction in the catalog APIs.
Since Druid inherited INFORMATION_SCHEMA
from Calcite, it picked up some columns that actually have no meaning in Druid. New users have to learn that, though the columns exist, they don't do anything, which is annoying. So, we propose to do a bit of house-cleaning.
Drop the following because Druid always works in UTF-8.
DEFAULT_CHARACTER_SET_CATALOG
DEFAULT_CHARACTER_SET_SCHEMA
DEFAULT_CHARACTER_SET_NAME
The following two columns may be useful as we work out catalog details:
SCHEMA_OWNER
- useful if Druid were to support temporary tables: such tables would reside in a schema owned by a specific user. There are no plans to add such a feature at the moment, but such a feature is common in SQL systems. Is currently always NULL
. Change to be druid
.SQL_PATH
- Not sure what SQL-92 uses this for, but it might be handy to allow aliases: the ability to "rename" a table by creating an alias known to SQL, while the native layer uses the original name. Again, there are no plans to add such a feature now, but it could be a way to overcome the "no rename" limitation in Druid.COLUMNS
ColumnsIn COLUMNS
, we propose to remove:
CHARACTER_MAXIMUM_LENGTH
- Used for CHAR(x)
and VARCHAR(x)
which Druid does not support.CHARACTER_OCTET_LENGTH
- As above.For Tables, we may want to add additional useful Druid information:
INSERT
statement if the user does not specify one.The above are needed so the SQL planner knows how to plan INSERT
statements against rollup tables. Since the SQL planner will use this information, INFORMATION_SCHEMA
should present it so the user sees what the planner uses.
To handle aging-based changes in datasources, we could introduce another table that provides these rules, but that is out of cope for this project.
For COLUMNS
we may add additional columns to express Druid attributes. This is very preliminary:
COMPLEX
). This would show that SUM(BIGINT)
or LATEST(VARCHAR)
idea discussed above.The following column changes meaning:
ORDINAL_POSITION
- Orders columns in the order they appear in metadata, so that users have control over column order. Any columns that exist in segments but do not appear in the catalog metadata appear after the metadata defined column, and appear in "classic" Druid alphabetical order. SELECT *
statements list columns in the order defined by this field.drive by comment re INFORMATION_SCHEMA
, i think it is a bit of a standard, https://en.wikipedia.org/wiki/Information_schema, so we need to be considerate about how we modify it I think. (I'll try to read and digest the rest of this proposal in the near future)
@clintropolis, thanks for the note on INFORMATION_SCHEMA
It appears that it was introduced in SQL 92, and appears to have a very 1980's feel to it in naming and types. Various DBs support varying aspects of the schema. A quick scan of the table-of-contents of the SQL 2016 spec suggests that the information schema still exists, though I'd have to buy the spec to find out how much it has evolved since SQL-92. MySQL for example adds many, many tables on top of the standard SQL-92 set.
The key fact is that INFORMATION_SCHEMA
is intended to be a set of views on top of the underlying "definition tables." There are no definition tables in Druid: information is stored in non-table form in multiple places. The proposed catalog adds to those sources of information. Druid's INFORMATION_SCHEMA
simulates the views by creating table contents on the fly. This aspect will remain unchanged in this proposal.
Updated the INFORMATION_SCHEMA
section to reflect this information.
Work is far enough along to propose the REST API. This material (with any needed updates) will go into the API docs as part of the PR.
PUT
for updates. Added "incremental" table update APIs.The catalog adds two sets of APIs: one for users, the other internal for use by Brokers. All APIs are based at /druid/coordinator/v1/catalog
. The user APIs are of two forms:
Note that all these APIs are for the metadata catalog entries; not the actual datasource. The table metadata can exist before the underlying datasource. Also, the definition can be deleted without deleting the datasource.
The primary catalog entry is the "table specification" (TableSpec
) which holds the user-provided information about a table. Internally, Druid adds additional information (schema, name, create date, update date, state, etc.) to create the full "table metadata" (TableMetadata
) stored in the metadata database. Users update the spec, the system maintains the full metadata.
POST {base}/tables/{dbSchema}/{name}[?action=New|IfNew|Replace|Force][&version={n}]
Configuration-as-code API to create or update a table definition within the indicated schema. Payload is a TableSpec
, defined below.
The schema must be the name of a valid writable Druid to which the user has write access. The valid schemas at present are druid
(for datasources) and input
(for input table definitions.) We may also support view
.
The user must have write access to the underlying datasource, even if the datasource does not yet exist. (That is, the user requires the same permissions as they will require when they first ingest data into the datasource.) For input tables, the user must have the extended INPUT
permission on the input source name.
Creates or replaces the table spec for the given table depending on the action
:
New
(default): creates the table if it does not exist, returns an error if the table already exists.IfNew
: As above, but succeeds (and does nothing) if the table already exists. Use this action to get the SQL IF NOT EXISTS
semantics.Replace
: replaces the existing spec for an existing table, returns an error if the table does not exist.Force
: creates the table if it does not exist, else replaces the existing spec with the one provided.In all cases, the operation uses the provided spec as is. See the PUT
operation below for partial updates.
For the Replace
, version
updates the table only if it is at the given version number. See below for details.
PUT {base}/tables/{dbSchema}/{name}[?version={n}]
Incremental API to update an existing table within the indicated schema, and with the given name. Payload is a TableSpec
, defined below. The schema must be as described for Create Table above.
The table spec can be partial and is merged with the existing spec. Merge rules are:
null
value.Columns are merged differently for different table types.
The API supports two "levels" of synchronization. By default, the new entry simply overwrites the existing entry. However, if version={n}
is included, then the update occurs only if the update timestamp in the current metadata DB record matches that given in the REST call. Using a version provides a form of "optimistic locking": first read the definition, make a change and send the update using the update time from the read. Doing this prevents accidental overwrites.
GET {base}/tables/{dbSchema}/{name}
Configuration-as-code API to read the table specification for the table given by a schema and table name. The user must have read access to the table. Returns a 404 (NOT FOUND) if the entry does not exist. Remember: the metadata entry exists independent of the datasource itself. The result is the TableSpec
defined below.
Note that the above are defined so that if one does a POST
to create a table, followed by a GET
, one ends up with the same table spec that one started with.
DELETE {base}/tables/{dbSchema}/{name}[?ifExists=true|false]
Drop the catalog entry for the given table and schema. The schema must exist. The table must also exist, and the user must have write access to the underlying datasource.
The optional ifExists=true
parameter provides SQL IF EXISTS
semantics: no error is given if the table does not exist.
GET {base}/schemas/{dbSchema}/table/{tableName}
Returns the full metadata for a table including system-maintained properties such as name, update time, table spec, and more. Use this form to obtain the update timestamp used for the version
optimistic locking feature. The pattern is:
POST
the updated spec, providing the version, to ensure no concurrent writes occurred.Columns a SELECT *
appear in the same order that they appear in the table spec. The PUT
operation above cannot change the order for datasources. To change column order, use the moveColumn
API:
POST {base}/tables/{dbSchema}/{name}/moveColumn
The payload is a JSON object of type ColumnOrder
that is of the form:
{
"column": "<name>",
"where": "first|last|before|after",
"anchor": "<name>"
}
A column can be moved to the start or end of the list. Or, it can be move to appear before or after some other column. The anchor
is ignored for the first
or last
options.
The operation fails if either the column, or the anchor (if provided) do not exist (which may occur if another writer deleted the column in the meantime.) The columns refer to entires in the catalog schema. A column may exist in the datasource but not in the catalog. Such columns can't be referenced in this API.
A client may wish to remove a specific datasource column. The PUT
operation above can't handle deletions, only additions (because addition is far more common.) Though this operation is primary for datasource, it works for input sources as well.
POST {base}/{base}/tables/{dbSchema}/{name}/dropColumn
The payload is a JSON list of strings that identifies the columns to be dropped.
Note that deleting a column means to remove the metadata entry for the column. This is not the same as hiding the column. This operation does not physically remove the column: if the column still exists in any segment, then the column will appear in the merged schema. Use this operation for the case that a metadata column entry was added by mistake, or if all instances of a previously-existing physical column have expired out of the segments.
Dropping a column does not drop the column from the hidden columns list. It is expected that, if a column is deleted, it would likely appear in the hidden columns list until all old segments with that column expire out of the system.
Datasources provide a "hide" operation for columns. Segments may contain columns which are no longer needed. To avoid the need to rewrite segments, the catalog can simply "hide" existing columns. The PUT
operation can append new hidden columns. This operation is a bit simpler, and can "unhide" an already-hidden column. The hidden column normally will not appear in the list of columns in the table spec: the name usually references a column that exists in segments, but is not actually needed.
The Payload is a HideColumns
object of the form:
{
"hide": [ "a", "b", "c" ],
"unhide": [ "d", "e" ]
}
GET {base}/list/schemas/names
Retrieves the list of the names of schemas known to the catalog, which includes the same set of schemas in the INFORMATION_SCHEMA.SCHEMATA
table. Note that, in the present version, the catalog does not support user-defined schemas.
The list is not filtered by permissions as Druid does not have schema-level permissions. All users can see all schemas (but not necessarily the contents of the schemas.)
GET {base}/list/tables/names
Retrieves the list of the names all tables known to the catalog across all schemas. Only some schemas allow definitions, only definitions appear. This is not a list of actual datasources or system tables: only a list of definitions.
The list is filtered based on user permissions: the list will omit tables for which the user does not have read access.
GET {base}/schemas/{dbSchema}/names
Retrieves the list of the names of tables within the given schema. This list contains only those tables for which metadata entries appear, and is thus a subset of those returned by INFORMATION_SCHEMA.TABLES
. The list contains only those tables for which the user has read access.
GET {base}/schemas/{dbSchema}/tables
Returns the list of all tables within the given schema for which the user has access. The return value is a list of the same objects returned from GET {base}/tables/{dbSchema}/{name}
.
POST {base}/flush
Causes the catalog to invalidate any caches. Available on both the Coordinator and the Broker. This API is required only if the catalog DB changes outside of Druid, and is primarily for testing.
GET {base}/tables/{dbSchema}/{name}/sync
Retrieve the entry for a single table within the given schema as a TableSpec
object. The user is assumed to be the Druid super-user. This API is primarily for use by the Broker node. Currently does the same as GET {base}/tables/{dbSchema}/{name}
, but this is subject to change as this is an internal API.
Returns the list of all table metadata, as TableSpec
objects, within the given schema. The user is assumed to be the Druid super-user. This API is primarily for use by the Broker node. Currently does the same as GET {base}/schemas/{dbSchema}/sync
, but this is subject to change as this is an internal API.
GET {base}/schemas/{dbSchema}/tables
The present version of Druid uses a Calcite feature to specify an ingest input table:
INSERT INTO dst SELECT *
FROM TABLE(extern(
'{
"type": "inline",
"data": "a,b,1\nc,d,2\n"
}',
'{
"type": "csv",
"columns": ["x","y","z"],
"listDelimiter": null,
"findColumnsFromHeader": false,
"skipHeaderRows": 0
}',
'[
{"name": "x", "type": "STRING"},
{"name": "y", "type": "STRING"},
{"name": "z", "type": "LONG"}
]'
))
PARTITIONED BY ALL TIME
As it turns out, SQL (and Calcite) allow the use of named parameters. We can rewrite the above as follows. Notice the name => value
syntax:
INSERT INTO dst SELECT *
FROM TABLE(extern(
inputSource => '{
"type": "inline",
"data": "a,b,1\nc,d,2\n"
}',
inputFormat => '{
"type": "csv",
"columns": ["x","y","z"],
"listDelimiter": null,
"findColumnsFromHeader": false,
"skipHeaderRows": 0
}',
signature => '[
{"name": "x", "type": "STRING"},
{"name": "y", "type": "STRING"},
{"name": "z", "type": "LONG"}
]'
))
PARTITIONED BY ALL TIME
The above is great, but can be a bit awkward: we have to encode JSON in SQL (which, when we send via the REST API, we encode again in JSON.) Let's how we can use SQL named parameters to streamline the syntax (and set ourselves up for the catalog.) SQL requires that parameter names be "simple identifiers": that is, no dots. So, we can't just say:
"inputSource.type" => "inline"
Instead, we have to "flatten" the names. That is, define SQL names that, internally, we map the the JSON names. The mapping is just code, so we omit the details here. Suppose we do the mapping. We now have a different set of arguments, so we need a different function. For now, let's call it staged
.
We also need a way to specify the input table schema. Here we borrow another bit of Calcite functionality, the EXTEND
clause which was added for Apache Phoenix. We modify the syntax a bit to fit our needs. The result:
INSERT INTO dst SELECT *
FROM TABLE(staged(
source => 'inline',
data => 'a,b,1
c,d,2
',
format => 'csv'
))
(x VARCHAR NOT NULL, y VARCHAR NOT NULL, z BIGINT NOT NULL)
PARTITIONED BY ALL TIME
Notice how the keywords in the staged
function arguments match the properties
defined in the REST call in the prior section. That is not an accident. That sets us up to merge the two ideas in the next update.
Using the above, the catalog allows the definition of a template table. To motivate this, let's start with a complete input table:
{
"type": "input",
"properties": {
"source": "local",
"file": "wikipedia.csv",
"format": "csv",
},
"columns": ...
}
The above can be run by referencing the name, say myWiki
:
SELECT * FROM `input`.`myWiki`
Druid, however, never ingests the same data twice: we want to read different files each time. Say wiki-2015-06-01.csv
one day, wiki-2016-06-02.csv
the next day. So, we simply omit the file
property above, converting the catalog entry to a template table:
{
"type": "input",
"properties": {
"source": "local",
"format": "csv",
},
"columns": ...
}
We have to parameterize the template to run it, using a table function with the same name as the input table:
SELECT * FROM (TABLE(`input`.myWiki`(file => `wiki-2016-06-02.csv`))
The result of the table function is a complete table, ready to run as if it was fully defined in the catalog.
Notice that how three pieces come together:
staged
, and simply uses the catalog fields as "default" values for the named SQL parameters.The catalog allows the user to define the ingest partitioning:
{
"dbSchema": "druid",
"name": "fiveMinDs",
"spec": {
"type": "datasource",
"segmentGranularity": "PT5M"
}
}
By doing so, the user can drop the PARTITIONED BY
clause in the INSERT
statement as shown above.
Supported values include:
PT5M
, PT1S
DAY
, FIVE_MINUTE
5 minutes
, 6 hours
Only the standard Druid values are supported: providing a non-standard interval will raise an error.
The multi-stage ingest engine allows the user to specify secondary partitioning, expressed as the CLUSTERED BY
clause. The clause includes a list of cluster keys, each of which is an expression and an optional "sort sense" (ASC
or DESC
). The expression is typically a column name, but can be an expression.
The catalog models this with a list of JSON objects:
{
"dbSchema": "druid",
"name": "fiveMinDs",
"spec": {
"type": "datasource",
"segmentGranularity": "PT5M",
"clusterKeys": [
{"column": "x"},
{"column": "y", "desc": true} ]
}
}
At present, the catalog supports only column names: additional work is needed in the SQL query layer to support expressions. (There is also some debate about whether the optimizer can correctly use expression columns, and so whether we actually need them.)
When this information is present in the catalog, the user can omit the CLUSTERED BY
clause from an INSERT
statement.
The multi-stage engine allows the user to specify the desired number of rows per output segment. This is presently done as a context setting. With the catalog, it can be specified in table metadata:
{
"dbSchema": "druid",
"name": "fiveMinDs",
"spec": {
"type": "datasource",
"segmentGranularity": "PT5M",
"targetSegmentRows": 4000000
}
}
A user-provided context setting takes precedence. If unset, the value is the default set by the multi-stage engine, which is currently 3 million rows.
generic
input source.This section describes Input format properties. It will make more sense if you read the following "Metadata structure" comment first. The joys of using issue comments for documentation...)
Every input table definition (or staged(.)
function call) must include the format
property, if the input source requires a format.
Property | Type | JSON Name | Description |
---|---|---|---|
format |
VARCHAR |
type |
Input format type |
Defines a CSV input format.
Property | Type | JSON Name | Description |
---|---|---|---|
format |
VARCHAR |
type |
Must be'csv' |
listDelimiter |
VARCHAR |
listDelimiter |
Delimiter for list values in a list. Defaults to CTRL-A |
skipRows |
INTEGER |
skipHeaderRows |
Number of rows to skip |
Only the format
property is required: Druid provides defaults for the other values.
TODO: Need a way to escape special characters for the listDelimiter
field.
When used in an input table, Druid does not yet support the ability to infer column names from the input file, so the findColumnsFromHeader
is not supported here.
CSV requires a list of columns. Provide these in the columns
section of the input table definition in the catalog, or using the extended function notation in SQL:
SELECT *
FROM TABLE(staged(
source => 'inline',
data => 'a,b,1
c,d,2
',
format => 'csv'
))
(x VARCHAR NOT NULL, y VARCHAR NOT NULL, z BIGINT NOT NULL)
Defines a Delimited text input format as a generalization of the CSV format. Properties default to provide a TSV (tab-separated values) format.
Property | Type | JSON Name | Description |
---|---|---|---|
format |
VARCHAR |
type |
Must be'tsv' |
delimiter |
VARCHAR |
delimiter |
A custom delimiter for data values. Defaults to TAB |
listDelimiter |
VARCHAR |
listDelimiter |
Delimiter for list values in a list. Defaults to CTRL-A |
skipRows |
INTEGER |
skipHeaderRows |
Number of rows to skip |
Usage and limitations are the same as for the CSV format above.
Defines a JSON input format.
Property | Type | JSON Name | Description |
---|---|---|---|
format |
VARCHAR |
type |
Must be'json' |
keepNulls |
BOOLEAN |
keepNulls |
Optional. |
The catalog does not yet support the flattenSpec
or featureSpec
properties.
Although batch ingestion does not require a list of columns, the multi-stage engine does. Provide columns the same way as described for the CSV format above.
The generic
input source is a direct representation of any arbitrary Druid input source, using the JSON representation of that source. The inputFormatSpec
property holds the JSON-serialized form of the input spec.
The catalog stores metadata in a generalized format designed to support a number of operations:
The general approach is to divide metadata into top-level objects. At present, only one object is available: tables. Others (connections, secrets, schedules) are envisioned for later. Within each object, there are one or more types. For tables, these are different kinds of datasources, different kinds of input tables, views, and so on. Each object has a set of properties, described as key/value pairs. Tables also have a set of columns.
The term specification (or spec) is used for the JSON object which the application writes into the catalog. The term metadata includes the spec, plus other general information such as a name, timestamp and so on.
The table metadata object holds two kinds of information: the system-defined metadata about the entry, and the user-provided table specification (TableSpec
). All tables, regardless of kind (datasource, input, view, etc.) use the same table metadata object: only the table specification part varies.
Example:
{
"Id": {
"schema":"druid",
"name":"read"
},
"creationTime":1654634106432,
"updateTime":1654634106432,
"state":"ACTIVE",
"spec": <TableSpec>
}
}
Fields:
schema
: Schema namename
: Table nameowner
: SQL owner. Not yet supported. Omitted if null (which is the only valid value at present.)creationTime
: UTC Timestamp (in milliseconds since the epoch) of the original creation time.updateTime
: UTC Timestamp (in milliseconds since the epoch) of the most recent creation or update. This is the record's "version" when using optimistic locking.state
: For datasources. Normally ACTIVE
. May be DELETING
if datasource deletion is in progress. (Not yet supported.)spec
: The user-defined table specification as described below.The user provides the id (schema and name) and the spec; the system maintains the other fields.
TableSpec
)The table specification holds the user-provided information for a catalog entry for the table. The structure of the spec is the same for all table types. The specific properties and columns depends on the type of table.
"spec": {
"type": "<type>",
"properties": { ... },
"columns": [ ... ]
}
Fields:
type
: Type of the table: one of the types defined below.properties
: A generic key/value list of properties. Properties can be defined by Druid, extension, or the user. Property values can be simple or structured as defined by the specific table type. Allows the application to attach application-specific data to the table.columns
: an ordered list of columns. Columns have a name, a SQL type and properties. Columns also have a column type as described below. Columns appear in a SELECT *
list in the same order that they appear here. UIs should also use this order when displaying columns.ColumnSpec
)Columns are of one of three types, depending on the kind of datasource. For a detail (non-rollup) datasource, use the column
type. For a roll-up table, use either dimension
or measure
. The general form is:
{
"type": "<type>",
"name": "<name>",
"sqlType": "<type>",
"properties": { ... }
}
Fields:
type
: the type of column. Not the data type (that's sqlType
) but rather the kind of column. For a rollup datasource, there are dimensions and measures. Perhaps in some tables there are computed columns, etc. Required.name
: the name of the column. Required.sqlType
: the data type of the column using SQL (not Druid) types. Required or optional depending on the table and column type.properties
: a set of arbitrary key-value pairs for the column. None are yet defined by Druid. This would be a handy place to store a user-visible description, perhaps hints to an application about how to render the column, and so on. We expect to add properties as time goes on. Optional (i.e can be null or omitted.)The datasource table type defines ingestion properties. Many other properties can be added: the current set is the bare minimum to support MSQ ingestion. The catalog describes the columns within the datasource, but not how the columns are used. That is the idea of "rollup" is something that an application may choose to apply to a datasource: it is not a datasource type and the catalog maintains no information about dimensions, measures and aggregates.
Example:
{
"type":"datasource",
"properties": {
"description": "<text>",
"segmentGranularity": "<period>",
"targetSegmentRows": <number>,
"clusterKeys": [ { "column": "<name">, "desc": true|false } ... ],
"hiddenColumns: [ "<name>" ... ]
},
"columns": [ ... ],
}
Properties:
description
: A human-readable description for use by UIs and users. Not used by Druid itself.segmentGranularity
: The segment size, using one of Druid's supported sizes. Equivalent to the PARTITION BY
clause on INSERT
statements.targetSegmentRows
: The number of rows to target per segment (when there is enough data for segment size to matter.) Defaults to Druid's usual value of 5 million rows. Change the value only for extreme cases, such as very large rows, or other tuning needs.clusterKeys
: The keys to use for secondary partitioning. Equivalent to the GROUP BY
clause in an MSQ query.hiddenColumns
: Existing datasource columns to hide from SQL.Cluster keys are defined as a list of JSON objects:
"clusterKeys": [ {"column": "state"}, {"column": "population", "desc": true} ]
More properties will come as work proceeds.
The list of columns act as hints:
SELECT *
in the order defined in the catalog. Any other columns that exist in the datasource appear after the catalog-defined columns.hiddenColumns
list, then that column is unavailable for querying via SQL. Use this list to hide columns which were ingested in error, or are no longer needed. Hiding a column avoids the need to physically rewrite segments to remove the column: the data still exists, but is not visible. Generally, a column should appear in the columns
list, or the hiddenColumns
list, but not both. If a column is in both, then the catalog definition is simply ignored.Datasource columns are provide a name and an optional SQL data type. The SQL type is optional. If not present, the physical type is used. If there are no segments yet, the SQL type defaults to VARCHAR
. In general, if adding a column to the catalog before data exists, it is best practice to specify the column type.
If the __time
column appears in the catalog, it must be a dimension
in a roll-up table. Its type must be null or TIMESTAMP. Include the
__time` column to indicate where it should appear in the table's schema.
An external table is a table that resides outside of Druid. (That is, the table is not a datasource.) External tables are the sources used for ingestion using INSERT
and SELECT
statements. MSQ allows querying of external tables. There is a different external table type for each kind of external table: local file, HTTP, S3 bucket and so on.
An external table is equivalent to an EXTERN
definition, but with the definition stored in the catalog. (See a note later about the extended table functions.) Druid defines the input source and format as JSON objects. The existing EXTERN
function uses JSON strings to set these values. The catalog uses a different approach: a "flattened" list of names which Druid maps internally to the fields in the JSON structure. This approach allows us to mesh the input source specification with table function arguments.
External tables are typically parameterized, as Druid generally does not ingest the same data twice (except when testing.)
Many external tables support multiple formats, which appear as a format specification within the table specification, as described below. The format properties appear within those for the table. The reason for this will become clear when we discuss parameterized tables and SQL integration.
External tables typically require a schema (though upcoming MSQ enhancements may allow MSQ to infer the schema a runtime.) The schema is provided via the same column specifications as used for datasources, but with a column type of extern
.
External tables are an abstraction on top of input sources. However, the catalog tends to represent input source properties somewhat differently than do the JSON objects. See below for details.
Column properties:
type
: Must be extern
.name
: A valid SQL identifier for the column. Case-sensitive.sqlType
: One of Druid's scalar data types: VARCHAR
, FLOAT
, DOUBLE
or BIGINT
. Case-insensitive.Columns must appear in the same order that they appear in the input file.
Example for an inline CSV external table:
{
"type": "inline",
"properties": {
"format": "csv",
"data": ["a,b,1", "c,d,2" ]
},
"columns": [
{
"name":"a",
"sqlType":"varchar"
},
...
]
}
The inline
table type includes the data in the table definition itself and is primary useful for testing. See Inline input source. It has just one property in addition to the format:
data
: A list of strings that represent the rows of data, typically in CSV format.The local
table type represents a file on the local file system, and is useful primarily for single-machine configurations. See Local input source. Provide a format. The table can be parameterized. Properties:
baseDir
: The directory from which to read data. If not given, the base directory is the one from which Druid was started, which is generally only useful for sample data.filePattern
: A pattern to match the files to ingest, such as "*.csv"
. (TODO: In glob or regex format?)files
: A list of file names, relative to the baseDir
.Provide either a pattern or a list of files. If both appear, (what happens?)
A local table has two parameters: filePattern
and files
. This means that a SQL statement can supply either the pattern, or a list of files per-query. Parameterization is explained in full elsewhere.
The http
table type allows reading data from an HTTP server. See HTTP input source. Supports formats. The HTTP table can be parameterized. Properties:
user
: The name of the user to send to the server using basic authentication.password
: The password to send. Use this for "light" security.passwordEnvVar
: The name of an environment variable on the server that holds the password value. Use this for slightly better, if awkward-to-configure, security.uris
: The set of URIs to read from. If the user/password is set, the URIs must be on the same server (or realm.)template
: Partial URI for a parameterized table. See below.When used from SQL, set the template
to the common part of the URIs: typically the part that includes the server for which the credentials are valid. Then, the parameter, uris
, provides a comma-delimited list of the specific items to read. For example:
template
of http://example.com/{}?format=csv
uris
parameter in the query of file1,file2
Again, parameterization is discussed elsewhere.
As noted above, most external tables allow multiple input formats as described in the Input format docs. The catalog form is, again, a bit different than the JSON form.
The set of formats described here is a "starter" set and will be expanded as the project proceeds.
Indicate the input format using the format
property:
format
: one of the supported input formats described here.Then, include other properties as needed for the selected format. See the note above about the details of the supported formats. The set of formats described here is a "starter" set and will be expanded as the project proceeds.
Thanks for the additional details :+1:
sqlType: One of Druid's scalar data types:
VARCHAR
,FLOAT
,DOUBLE
orBIGINT
. Case-insensitive.
How do you plan to support Druid's complex typed columns? (Such as the recently added COMPLEX<json>
columns)? Complex types are currently case sensitive since they are registered internally in a map however they are defined (potentially via extensions), so it would take some work (and potentially be backwards incompatible to make them not be case sensitive).
The reason I'm asking is that i'm still a bit worried about how we are going to cleanly map this to Druids type system. Is it going to be a strict mapping, like exactly 1 SQL type to 1 Druid type? Or will it be permissive? (e.g. INTEGER
, BIGINT
, etc all just map to most appropriate Druid type, LONG
in this case). I guess I wonder if we should also allow using a RowSignature
or something here which is defined in Druid's native type system so that these schemas can model all possible schemas that can be created today (and the way internal via segment metadata schemas are currently built) as an alternative to defining types using SQL types since the native types also serialize into simple strings.
@clint, thanks for the head's up on the complex types. Can you point me to documentation on the details of the type? To any SQL support we already have?
One question is whether a COMPLEX<json>
column is logically one opaque "blob" (with whatever data appeared on input), or is a compound type where the user defines the fields.
If a JSON column is a blob, then we could look at the Drill MAP
type: where a column foo
is simply declared as type MAP
, which then enables a set of operations and functions, just like any other type. Presumably we'd implement something like the Postgres JSON functions, which are based on the SQL standard.
If the user must declare the structure of a JSON object, then we do have a compound type. In that case, each column is, itself, a record, and can have a nested schema, to however many levels we choose to support. Experience with Drill showed that users ability to deal with schema is limited when it is one level, and rapidly falls to zero when schemas are nested: most of us just don't want to think that hard! Java (and Go, etc.) have language tools to work with nested records, SQL was designed for an earlier era and working with nested data is painful.
Regardless of how we proceed, we can use the Posgres JSON operators to access fields within a JSON blob, but that would require Calcite parser changes. On the other hand, Drill did coax Calcite into allowing references to MAP
columns that look like table record references: myTable.myMap.myNestedMap.myValue
. Regardless of syntax, these would translate internally into functions, maybe json_get(col, path)
or some such. Perhaps you've already implemented these functions?
@clintropolis, you also asked about SQL mapping. My suggestion is to enforce a limited set of types: VARCHAR
, BIGINT
, FLOAT
and DOUBLE
, which directly correspond to the Druid storage types. (Other types can be intermediate values.) This way, if Druid were ever to support a 1-byte integer value, we could use TINYINT
(or BOOLEAN
) to label that type. If we mapped TINYINT
to long
internally, then we'd have an ambiguous mess later on. We've already got the beginnings of ambiguity with TIMESTAMP
: it has a meaning in SQL, but we just work with it as a long
. SQL does require rigor in the type system to keep everything straight.
thanks for the head's up on the complex types. Can you point me to documentation on the details of the type? To any SQL support we already have?
(heh, I think you tagged the wrong person in your comments, sorry other @clint 😅 ). Nested data columns are described in proposal #12695 and PR #12753. They are wired up to SQL, though I'm mostly just using them as an example. Like all complex types is currently handled in a more or less opaque manner (functions which know how to deal with COMPLEX<json>
do things with it, things that aren't aware do not). This was maybe not a great example because I'm considering making this stuff into top level native Druid types, though it would most likely be in the addition of both VARIANT
and STRUCT
(or MAP
or something similar), since if it were done entirely with native types the current COMPLEX<json>
is effectively whatever type it encounters (so might be normal scalar types LONG
, STRING
, etc; a VARIANT
type; a nested type STRUCT
, arrays of primitives ARRAY<LONG>
etc; arrays of objects ARRAY<STRUCT>
, nested arrays ARRAY<ARRAY<STRUCT>>
and so on).
Complex types can be defined as dimensions or metrics, so we can't count on defining them all in terms of aggregators.
Internally, we currently build the SQL schemas for Calcite with DruidTable
which represents the schema with a RowSignature
which is defined using Druid native types which it collects from SegmentMetadata queries. Complex types are represented internally in Calcite with ComplexSqlType
, whenever it is necessary to represent them as an actual SQL type, though this is a relatively new construct that isn't used everywhere yet (since many of our functions which have complex inputs and outputs that predate this construct at the calcite level will use the ANY
and OTHER
sql types and defer actual validation that it is the correct complex type until translated to native Druid query which can check against the native Druid types in the RowSignature of the table).
My suggestion is to enforce a limited set of types: VARCHAR, BIGINT, FLOAT and DOUBLE, which directly correspond to the Druid storage types.
This is my main point, these are not the only Druid storage types, the current proposal is only able to model a rather small subset of the types which can appear in Druid segments. The complex type system is extensible, meaning there is potentially a large set of complex types based on what set of extensions is loaded. Internally these are all basically opaque, which is why we have the generic COMPLEX<typeName>
json representation of the native type, which we use to extract the typeName
and can lookup the handlers for that type. Many of these types are tied to aggregators, but multiple aggregators can make the same type, and many aggregators support ingesting pre-existing (usually binary) format values. I think we need something generic like COMPLEX<typeName>
that we use for the native types so that we can retain the typeName
so that the functions can perform validation and provide meaningful error messages when using a COMPLEX<thetaSketch>
input on a function that expects COMPLEX<HLLSketch>
or whatever, and then in the native layer to choose the correct set of handlers for the type. Otherwise every single complex type will need to devise a way for the catalog to recognize it, which sounds like a lot of work for rather low utility.
There will also likely be ARRAY
typed columns in the near future, so we'll need to be able to be sure we can model those as well, where I guess if it handles stuff like VARCHAR ARRAY
it would be fine as currently proposed, though i've seen other ways of defining array types in the wild (looks at bigquery, though i used the same syntax for the native Druid type representation...) so i'm not sure how hard the standard is here.
Based on advice from others, I've dropped the ideas around rollup tables: there will be no attempt to describe the aggregations for a rollup table. We'll leave that to the user to decide how to structure rollups.
@clintropolis notes:
these are not the only Druid storage types, the current proposal is only able to model a rather small subset of the types which can appear in Druid segment
The intention is that a combination of the column spec and column type provides a description of all possible column types. Sorry if that was not clear: the focus in the aggregate section was on, well, aggregates. I just hadn't gotten far enough to need to deal with the others yet.
One constraint I want us to keep in mind is that we'd like to eventually allow DDL statements something like:
CREATE ROLLUP TABLE foo (
__time TIMESTAMP,
a IP_ADDRESS,
b ARRAY(STRING),
c SUM(LONG),
d STRUCT(e STRING, f DOUBLE),
g VARCHAR WITH COMPACT INDEX
)
PARTITION BY DAY
CLUSTER BY a, g
So, the type names have to be SQL-like and SQL-parsable.
With a bit more research on complex types, it sounds like we have three categories:
long
), but have additional semantics which the user has to provide on ingest, compaction and query. To turn compaction into "auto-compaction", we store the aggregation in segments, but that is invisible to users.My proposal (since withdrawn until we rethink it) is:
IP_ADDRESS
.SUM(LONG)
.ARRAY(LONG)
.There is no good answer for user-visible structures because those are not part of the SQL domain of discourse. There is an ill-fated project, SQL++ that tried to find a solution. Seems it was adopted by Apache Asterix and CouchBase.
In Drill, we handled the types outside of SQL by using (an earlier version of) an Arrow-like format. The current thinking is to adapt that pattern to be more SQL and Druid-like for use in the catalog, and in eventual SQL DDL statements. For example we could invent syntax such as STRUCT(a STRUCT(b BIGINT, c VARCHAR), d DOUBLE)
.
Array columns can be represented similarly: ARRAY(DOUBLE)
, say. FWIW, Arrow useslist<double>
.
For the first catalog PR, the types are "to be named later": we're just focusing on storing the type names, whatever we decide they are. This gives us time to continue the type name discussion.
The catalog proposes using a different column "kind" for dimensions and measures. (Where "kind" is the Jackson type field in the JSON.) In this way, we know the difference between a complex dimension (such as IP_ADDRESS
) and a complex measure (SUM(LONG)
). If there are types that can be both a dimension and a measure (are both aggregates and not), then the column "kind" would disambiguate.
The kind, by the way, allows us to specify other per-column-kind information. For example, if there are multiple choices for an index type for dimensions, that would be a dimension-only attribute of a column as suggested in the DDL sketch above.
Anyway, the point is taken: we do need a full design for all column types. I'll work up something.
Updated the proposal to remove the idea of a rollup table. That idea will come as a separate proposal later. The non-spec comments above preserve the discussion: the "spec" comments describe the updated design.
Since column type is now just the storage type, we can use the Druid names and optional SQL aliases. The type used in the catalog is the Druid type, converted to SQL syntax. That is, COMPLEX<FOO>
would become COMPLEX(FOO)
. Complex types are defined in extensions (typically) each such extension can define a type alias, such as just FOO
for the above example.
Druid is a powerful database optimized for time series at extreme scale. Druid provides features beyond those of a typical RDBMS: a flexible schema, ability to ingest from external data sources, support for long-running ingest jobs, and more. Users coming from a traditional database can be overwhelmed by the many choices. Any given application or datasource uses a subset of those features; it would be convenient for Druid, rather than the user, to remember the feature choices which the user made.
For example, Druid provides many different segment granularities, yet any given datasource tends to prefer one of them on ingest. Druid allows each segment to have a distinct schema, but many datasources want to ensure that at least some minimal set of “key” columns exist. Most datasources use the same metric definitions on each ingest. And so on.
Traditional RDBMS systems use a catalog to record the schema of tables, the structure of indexes, entity-relationships between tables and so on. In such systems, the catalog is an essential part of the system: it is the only way to interpret the layout of the binary table data, say, or to know which indexes relate to which tables. Druid is much different: each segment is self-contained: it has its own “mini-catalog.”
Still, as Druid adds more SQL functionality, we believe it will be convenient for users to have an optional catalog of table (datasource) definitions to avoid the need to repeat common table properties. This is especially useful for the proposed multi-stage ingest project.
Proposal Summary
Proposed is an add-on metadata catalog that allows the user to record data shape decisions in Druid and reuse them. The catalog contains:
Technically, the proposal envisions the following:
Motivation
With the catalog, a user can define an ingestion input source separate from a SQL INSERT statement. This is handy as the current EXTERN syntax requires that the user write out the input definition in JSON within a SQL statement.
The user first defines the input table, using the REST API or (eventually) the SQL DDL statements. Then, the user references the input table as if it were a SQL table. An example of one of the
CalciteInsertDmlTest
cases using an input table definition:Here
input
is a schema that contains input table definition, whileinline
is a user-defined table that is an in-line CSV input source.Similarly, when using SQL to ingest into a datasource, the user can define things like segment granularity in the catalog rather than manually including it in each SQL statement.
We expect to support additional use cases over time: the above should provide a sense of how the catalog can be used.
Catalog as "Hints"
Druid has gotten by this long without a catalog, so the use of the catalog is entirely optional: use it if it is convenient, specify things explicitly if that is more convenient. For this reason, the catalog can be seen as a set of hints. The "hint" idea contrasts with the traditional RDBMS (or the Hive) model in which the catalog is required.
External Tables
Unlike query tools such as Druid, Impala or Presto, Druid never reads the same input twice: each read ingests a distinct set of input files. The external table definition provides a way to parameterize the actual set of files: perhaps the S3 bucket, or HDFS location is the same, the file layout is the same, but the specific files differ on each run.
Resolve the Chicken-and-Egg Dilemma
We noted above that segments are their own "mini-catalogs" and provide the information needed for compaction and native queries to do their job. The problem is, however, creating segments, especially the first ones: there is no "mini-catalog" to consult: the user has to spell out the details. The catalog resolves this dilemma by allowing the metadata to exist before the first segment. As a bonus, once a table (datasource) is defined in the catalog, it can be queried, though the query will obviously return no rows. A
SELECT *
will return, the defined schema. Similarly, if a user adds a column to the table, that column is immediately available for querying, even it returns allNULL
values. This makes the Druid experience just a bit simpler as the user need not remember when a datasource (or column) will appear (after the first ingestion of non-null data.)Query Column Governance
Druid allows columns to contain can kind of data: you might start with a
long
(BIGINT
) column, later ingestdouble
(DOUBLE
) values, and even later decide to make the column astring
(VARCHAR
). The SQL layer uses the latest segment type to define the one type which SQL uses. The catalog lets the user specify this type: if the catalog defines a type for a column, then all values are cast to that type. This means that, even if a column is all-null (or never ingested), SQL still knows the type.Cold Storage
Druid caches segments locally in Historical nodes. Historicals report the schema of each segment to the Broker, which uses them, as described above, to work out the "SQL schema" for a datasource. But, what if Druid were to provide a "cold tier" mode in which seldom-used data resides only in cold storage? No Historical would load the segment, so the Broker would be unaware of the schema. The catalog resolves this issue by letting the user define the schema separately from the segments that make up the datasource.
Components
The major components of the metadata system follow along the lines of similar mechanisms within Druid: basic authentication, segment publish state, etc. There appears to be no single Druid sync framework to keep nodes synchronized with the Coordinator, so we adopt bits and pieces from each.
Metadata DB Extension
Defines a new table, perversely named "tables", that holds the metadata for a "table." A datasource is a table, but so is a view or an input source. The metadata DB extension is modeled after many others: it provides the basic CRUD semantics. It also maintains a simple version (timestamp) to catch concurrent updates.
REST Endpoint
Provides the usual CRUD operations via REST calls as operations on the Coordinator, proxied through the Router. Security in these endpoints is simple: it is based on security of the underlying object: view, datasource, etc.
DB Synchronization
Keeps Broker nodes updated to the latest state of the catalog DB. Patterned after the mechanism in the basic auth extension, but with a delta update feature borrowed from an extension that has that feature.
Planner Integration
Primary focus on this project is using catalog metadata for SQL statements, and, in particular, INSERT and REPLACE statements. Input tables replace the need for the EXERN macro; datasource metadata replaces the need to spell out partitioning and clustering.
SQL DDL Statements
As Druid extends its SQL support, an obvious part of this catalog proposal would be DDL statements such as
CREATE/ALTER/DROP TABLE
, etc. This support is considered a lower priority because:Rollup Datasources
NOTE: This section is now out of scope and is no longer planned. Leaving this here to spur future discussion.
The main challenge is around rollup datasources. In rollup, the datasource performs aggregation. It is easy to think that ingestion does the aggregation, but consider this example: ingest a set of files, each with one row. You'll get a set of, day, dozens of single-row segments, each with the "aggregation" of a single row. The compaction mechanism then combines these segments to produce one with overall totals. This process continues if we add more segments in the same time interval and compact again.
This little example points out that compaction knows how to further aggregate segments: even those with a single row. Of course, ingestion can do the same trick, if there happen to be rows with the same dimensions. But, since compaction can also do it, we know that there is sufficient state in the one-row "seed" aggregate for further compaction to occur. We want to leverage this insight.
The idea is, in SQL INSERT-style ingestion, the work happens in three parts:
This means that we can convert the following example:
To this form:
Here:
__time
column, the metadata says the rollup grain, so that the user can omit theTIME_FLOOR
it in the SQL: the metadata will cause the planner to insert the properTIME_FLOOR
function.COUNT
column is not specified: it is implicitly 1 for every row: no need to have the user tell us that. Later stages use a "sum" to accumulate the counts, as today.APPROX_COUNT_DISTINCT_DS_HLL
function takes a single argument, so the planner can infer to use that function to convert from a scalar to a "seed" aggregate.The column-level rules operate much line the built-in type coercion rules which SQL provides. Instead of simply converting a
INT
toBIGINT
, Druid add rules to implicitly convert a scalarBIGINT
to aSUM(BIGINT)
column.Extensions
A possible feature is to allow an external service to provide the catalog via an extension. An example of this is to use the Confluent schema registry, the Hive Metastore, etc. We'll flesh out this option a bit more as we get further along.
Alternatives
The three main alternatives are:
Since the catalog will become core to Druid, we tend to favor creating one focused on the rather unique needs which Druid has.
Development Phases
This project has multiple parts. A basic plan is:
CREATE/ALTER/DROP TABLE
, etc.)Backward Compatibility
Existing Druid installations will create the new catalog table upon upgrade. It will start empty. If a datasource has no metadata then Druid will behave exactly as it did before the upgrade.
If a version with the catalog is downgraded, the old Druid version will simply ignore the catalog and the user must explicitly provide the properties formerly provided by the catalog.