apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.52k stars 3.71k forks source link

Fix backward compatibility issues in WindowOperatorQueryFrameProcessorFactory and WindowOperatorQueryFrameProcessor #17433

Closed Akshat-Jain closed 3 weeks ago

Akshat-Jain commented 3 weeks ago

Description

As part of the GlueingPartitioningOperator changes in #17038, we removed 2 fields from WindowOperatorQueryFrameProcessorFactory: maxRowsMaterializedInWindow and partitionColumnNames. This introduces backward incompatibility when the MSQ controller has the Glueing PR changes, but the worker doesn't: image

This PR adds those fields back to ensure backward compatibility.

Even after adding the 2 fields back, if controller has the Glueing PR changes, but workers don't - then we run into another issue where the controller sends the operatorFactoryList with the new operators, but the workers aren't aware of the new operators (GlueingPartitioningOperator and PartitionSortOperator). This causes the following issue:

org.apache.druid.rpc.HttpResponseException: Server error [400 Bad Request]; body: {"error":"Please make sure to load all the necessary extensions and jars with type 'glueingPartition' on 'druid/indexer' service. Could not resolve type id 'glueingPartition' as a subtype of `org.apache.druid.query.operator.OperatorFactory` known type ids = [naivePartition, naiveSort, scan, window] (for POJO property 'operatorList')

image

This PR handles this by moving the operator transformation logic (NaiveSortOperator -> NaivePartitioningOperator -> WindowOperator to GlueingPartitioningOperator -> PartitionSortOperator -> WindowOperator) from WindowOperatorQueryKit layer to the WindowOperatorQueryFrameProcessor layer. This would allow the worker to either run the older operator chain (if they are on older version, not having the Glueing PR changes), or run the new operator chain (if they have the Glueing PR changes).

Test Plan

To test out the compatibility scenarios, I ran 2 indexers on my local setup, and validated queries for following cases:

  1. Indexer1 (controller) is on older version, indexer2 (some subset of workers) is on newer version.
  2. Indexer1 (controller) is on newer version, indexer2 (some subset of workers) is on older version.

Release note

We are marking 2 fields deprecated for window query execution for MSQ task engine. These will be removed in future releases of Druid, so the upgrade plan should involve this intermediate upgrade stage with these backward compatibility code changes.


This PR has: