apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.5k stars 3.7k forks source link

[Draft] 0.21.0 Release Notes #10752

Closed jihoonson closed 3 years ago

jihoonson commented 3 years ago

Apache Druid 0.21.0 contains around 120 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New features

# Operation

# Service discovery and leader election based on Kubernetes

The new Kubernetes extension supports service discovery and leader election based on Kubernetes. This extension works in conjunction with the HTTP-based server view (druid.serverview.type=http) and task management (druid.indexer.runner.type=httpRemote) to allow you to run a Druid cluster with zero ZooKeeper dependencies. This extension is still experimental. See Kubernetes extension for more details.

https://github.com/apache/druid/pull/10544 https://github.com/apache/druid/pull/9507 https://github.com/apache/druid/pull/10537

# New dynamic coordinator configuration to limit the number of segments when finding a candidate segment for segment balancing

You can set the percentOfSegmentsToConsiderPerMove to limit the number of segments considered when picking a candidate segment to move. The candidates are searched up to maxSegmentsToMove * 2 times. This new configuration prevents Druid from iterating through all available segments to speed up the segment balancing process, especially if you have lots of available segments in your cluster. See Coordinator dynamic configuration for more details.

https://github.com/apache/druid/pull/10284

# status and selfDiscovered endpoints for Indexers

The Indexer now supports status and selfDiscovered endpoints. See Processor information APIs for details.

https://github.com/apache/druid/pull/10679

# Querying

# New grouping aggregator function

You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set. See Aggregation functions for more details.

https://github.com/apache/druid/pull/10518

# Improved missing argument handling in expressions and functions

Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.

https://github.com/apache/druid/pull/10499

# Allow zero period for TIMESTAMPADD

TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.

https://github.com/apache/druid/pull/10550

# Ingestion

# Native parallel ingestion no longer requires explicit intervals

Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.

https://github.com/apache/druid/pull/10592 https://github.com/apache/druid/pull/10647

# Old Kafka version support

Druid now supports Apache Kafka older than 0.11. To read from an old version of Kafka, set the isolation.level to read_uncommitted in consumerProperties. Only 0.10.2.1 have been tested up until this release. See Kafka supervisor configurations for details.

https://github.com/apache/druid/pull/10551

Multi-phase segment merge for native batch ingestion

A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge. See tuningConfig for native batch ingestion for more details.

https://github.com/apache/druid/pull/10689

# Native re-ingestion is less memory intensive

Parallel tasks now sort segments by ID before assigning them to subtasks. This sorting minimizes the number of time chunks for each subtask to handle. As a result, each subtask is expected to use less memory, especially when a single Parallel task is issued to re-ingest segments covering a long time period.

https://github.com/apache/druid/pull/10646

# Web console

# Updated and improved web console styles

The new web console styles make better use of the Druid brand colors and standardize paddings and margins throughout. The icon and background colors are now derived from the Druid logo.

image

https://github.com/apache/druid/pull/10515

# Partitioning information is available in the web console

The web console now shows datasource partitioning information on the new Segment granularity and Partitioning columns.

Segment granularity column in the Datasources tab

97240667-1b9cb280-17ac-11eb-9c55-e312c24cd8fc

Partitioning column in the Segments tab

97240597-ebedaa80-17ab-11eb-976f-a0d49d6d1a40

https://github.com/apache/druid/pull/10533

# The column order in the Schema table matches the dimensionsSpec

The Schema table now reflects the dimension ordering in the dimensionsSpec.

image

https://github.com/apache/druid/pull/10588

# Metrics

# Coordinator duty runtime metrics

The coordinator performs several 'duty' tasks. For example segment balancing, loading new segments, etc. Now there are two new metrics to help you analyze how fast the Coordinator is executing these duties.

https://github.com/apache/druid/pull/10603

# Query timeout metric

A new metric provides the number of timed out queries. Previously timed out queries were treated as interrupted and included in the query/interrupted/count (see Changed HTTP status codes for query errors for more details).

query/timeout/count: the number of timed out queries during the emission period

https://github.com/apache/druid/pull/10567

# Shuffle metrics for batch ingestion

Two new metrics provide shuffle statistics for MiddleManagers and Indexers. These metrics have the supervisorTaskId as their dimension.

To enable the shuffle metrics, add org.apache.druid.indexing.worker.shuffle.ShuffleMonitor in druid.monitoring.monitors. See Shuffle metrics for more details.

https://github.com/apache/druid/pull/10359

# New clock-drift safe metrics monitor scheduler

The default metrics monitor scheduler is implemented based on ScheduledThreadPoolExecutor which is prone to unbounded clock drift. A new monitor scheduler, ClockDriftSafeMonitorScheduler, overcomes this limitation. To use the new scheduler, set druid.monitoring.schedulerClassName to org.apache.druid.java.util.metrics.ClockDriftSafeMonitorScheduler in the runtime.properties file.

https://github.com/apache/druid/pull/10448 https://github.com/apache/druid/pull/10732

# Others

# New extension for a password provider based on AWS RDS token

A new PasswordProvider type allows access to AWS RDS DB instances using temporary AWS tokens. This extension can be useful when an RDS is used as Druid's metadata store. See AWS RDS extension for more details.

https://github.com/apache/druid/pull/9518

# The sys.servers table shows leaders

A new long-typed column is_leader in the sys.servers table indicates whether or not the server is the leader.

https://github.com/apache/druid/pull/10680

# druid-influxdb-emitter extension supports the HTTPS protocol

See Influxdb emitter extension for new configurations.

https://github.com/apache/druid/pull/9938

# Docker

# Small docker image

The docker image size is reduced by half by eliminating unnecessary duplication.

https://github.com/apache/druid/pull/10506

# Development

# Extensible Kafka consumer properties via a new DynamicConfigProvider

A new class DynamicConfigProvider enables fetching consumer properties at runtime. For instance, you can use DynamicConfigProvider fetch bootstrap.servers from location such as a local environment variable if it is not static. Currently, only a map-based config provider is supported by default. See DynamicConfigProvider for how to implement a custom config provider.

https://github.com/apache/druid/pull/10309

# Bug fixes

Druid 0.21.0 contains 30 bug fixes, you can see the complete list here.

# Post-aggregator computation with subtotals

Before 0.21.0, the query fails with an error when you use post aggregators with sub-totals. Now this bug is fixed and you can use post aggregators with subtotals.

https://github.com/apache/druid/pull/10653

# Indexers announce themselves as segment servers

In 0.19.0 and 0.20.0, Indexers could not process queries against streaming data as they did not announce themselves as segment servers. They are fixed to announce themselves properly in 0.21.0.

https://github.com/apache/druid/pull/10631

# Validity check for segment files in historicals

Historicals now perform validity check after they download segment files and re-download automatically if those files are crashed.

https://github.com/apache/druid/pull/10650

# StorageLocationSelectorStrategy injection failure is fixed

The injection failure while reading the configurations of StorageLocationSelectorStrategy is fixed.

https://github.com/apache/druid/pull/10363

# Upgrading to 0.21.0

Consider the following changes and updates when upgrading from Druid 0.20.0 to 0.21.0. If you're updating from an earlier version than 0.20.0, see the release notes of the relevant intermediate versions.

# Improved HTTP status codes for query errors

Before this release, Druid returned the "internal error (500)" for most of the query errors. Now Druid returns different error codes based on their cause. The following table lists the errors and their corresponding codes that has changed:

Exception Description Old code New code
SqlParseException and ValidationException from Calcite Query planning failed 500 400
QueryTimeoutException Query execution didn't finish in timeout 500 504
ResourceLimitExceededException Query asked more resources than configured threshold 500 400
InsufficientResourceException Query failed to schedule because of lack of merge buffers available at the time when it was submitted 500 429, merged to QueryCapacityExceededException
QueryUnsupportedException Unsupported functionality 400 501

There is also a new query metric for query timeout errors. See New query timeout metric for more details.

https://github.com/apache/druid/pull/10464 https://github.com/apache/druid/pull/10746

# Query interrupted metric

query/interrupted/count no longer counts the queries that timed out. These queries are counted by query/timeout/count.

# context dimension in query metrics

context is now a default dimension emitted for all query metrics. context is a JSON-formatted string containing the query context for the query that the emitted metric refers to. The addition of a dimension that was not previously alters some metrics emitted by Druid. You should plan to handle this new context dimension in your metrics pipeline. Since the dimension is a JSON-formatted string, a common solution is to parse the dimension and either flatten it or extract the bits you want and discard the full JSON-formatted string blob.

https://github.com/apache/druid/pull/10578

# Deprecated support for Apache ZooKeeper 3.4

As ZooKeeper 3.4 has been end-of-life for a while, support for ZooKeeper 3.4 is deprecated in 0.21.0 and will be removed in the near future.

https://github.com/apache/druid/issues/10780

# Consistent serialization format and column naming convention for the sys.segments table

All columns in the sys.segments table are now serialized in the JSON format to make them consistent with other system tables. Column names now use the same "snake case" convention.

https://github.com/apache/druid/pull/10481

# Known issues

# Known security vulnerability in the Thrift library

The Thrift extension can be useful for ingesting files of the Thrift format into Druid. However, there is a known security vulnerability in the version of the Thrift library that Druid uses. The vulerability can be exploitable by ingesting maliciously crafted Thrift files when you use Indexers. We recommend granting the DATASOURCE WRITE permission to only trusted users.

For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

# Credits

Thanks to everyone who contributed to this release!

@a2l007 @abhishekagarwal87 @asdf2014 @AshishKapoor @awelsh93 @ayushkul2910 @bananaaggle @capistrant @ccaominh @clintropolis @cloventt @FrankChen021 @gianm @harinirajendran @himanshug @jihoonson @jon-wei @kroeders @liran-funaro @martin-g @maytasm @mghosh4 @michaelschiff @nishantmonu51 @pcarrier @QingdongZeng3 @sthetland @suneet-s @tdt17 @techdocsmith @valdemar-giosg @viatcheslavmogilevsky @viongpanzi @vogievetsky @xvrl @zhangyue19921010

FrankChen021 commented 3 years ago

Hi @jihoonson , should #10363 be mentioned in the release note since it has changed the user-facing configurations ?

jihoonson commented 3 years ago

@FrankChen021 hmm, sure. It seems worth calling it out. I added it under the bug fixes section.

himanshug commented 3 years ago

hey @jihoonson , I am very excited about Service discovery and leader election based on Kubernetes but I am in the process of writing a PR that runs #10680 for k8s based deployment to stress test the k8s based leader election mechanism ... In that process I found two improvements, (1) upgrade to newer version of kubernetes-client library which has some leader election improvements and makes leader changes happen quicker if a node is gracefully stopped (2) a bug which shows up in specific deployments where druid.host gets set to something more than 63 characters which is the max length limit on kubernetes label values

I will address all of the above including introducing the new build in a new PR, those updates would make this feature a lot more robust and production ready. Thanks to @clintropolis for introducing #10680 which really stresses the overlord/coordinator startup, discovery, leader election machinery to help discover above improvements.

that said, I think, 0.21.0 should continue to go on its schedule. For that reason, I think we should remove the mention of this new feature in 0.21.0 release notes and we can document its availability in next release.

jihoonson commented 3 years ago

that said, I think, 0.21.0 should continue to go on its schedule. For that reason, I think we should remove the mention of this new feature in 0.21.0 release notes and we can document its availability in next release.

@himanshug thanks for taking a look. I understand it's not matured enough to use in production yet, but would like to keep it because it's one of the most exciting new features that 0.21.0 introduces and worth drawing people's attention. From my experience, most people don't want to test new features out in their production right away when they are first introduced. Rather, they would like to wait for a couple of releases until those features become matured and stable. In this case, as it is stated as experimental, people would be especially careful before trying it out. Later, once it becomes production ready, we can promote it again in the release notes by stating like "this cool feature is now production ready. everyone should try it out".

If we didn't mention it in this release, but in some future release when it becomes more stable, I think that people could still underestimate its maturity because it would be a feature that is promoted in the release notes first time. What do you think?

himanshug commented 3 years ago

sure, you are the release manager :)

anyways, now that the conversation is here in this issue, power users would likely note the upcoming improvements. however, by that time new PR would be done and merged for them to cherry pick that specific commit.

jihoonson commented 3 years ago

Thanks. That sounds good :slightly_smiling_face:

Gabriel39 commented 3 years ago

Hi @jihoonson this link https://druid.apache.org/docs/latest/development/extensions-core/kubernetes.html seems like outdated with 404 error. Is there a new site to refer about k8s extension?

jihoonson commented 3 years ago

Hi @Gabriel39, the kubernetes extension is added in 0.21.0 and not available in the latest release (0.20.1) yet. The doc will be updated once 0.21.0 is released.

stroeovidiu commented 3 years ago

Hello, Any update when the 0.21.0 will be released ? Thank you !

jihoonson commented 3 years ago

Hi @stroeovidiu, the vote for 0.21.0 release is in progress. It will pass in a couple of days unless we find some issue in the release.

jp707049 commented 3 years ago

Hello,

Thank you :)

jihoonson commented 3 years ago

Hi @jp707049, no worries and sorry for the delay. The 0.21.0 release vote has just passed! We will announce the release tomorrow.

When you try the k8s extension, please keep it mind that the extension is experimental and so you might be able to see some unexpected bugs.

xvrl commented 3 years ago

@jihoonson do you plan to update the release tag with the release notes and close this?

jihoonson commented 3 years ago

Apologize for the delay. I'm finalizing the release process now.

jihoonson commented 3 years ago

I just announced the 0.21.0 release. Thanks everyone for your patience!