apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

New storage / retention policy API in Druid #14330

Open abhishekrb19 opened 1 year ago

abhishekrb19 commented 1 year ago

Motivation

While Druid's rules (load, drop, and broadcast rules) and kill tasks are powerful, they can be complex to use and understand, especially in the context of retention. Druid users need to think about the lifecycle of segments (used/unused), map to tiered replicants, and add the appropriate imperative rules in the correct order to the rule chain.

Proposed changes

At a high level, users can define a storage policy for the hot tier (aka historical tier) and the deep storage. To that effect, introduce a storage policy API that translates user-defined policies to one or more load and drop rules under the hoods.

New API /druid/coordinator/v1/storagePolicy/<dataSource>

The API will accept two parameters in the create payload:

Translation of storage policy to load & drop rules

A few use cases along with the storage policy payloads and the corresponding internal load/drop rules is shown below:

Intent Storage Policy Load/Drop Rule
Keep the most recent hour of data
in the hot tier and permanently delete
all data older than 30 days.
{
  "hot": {
    "type": "period",
    "period": "PT1H"
  },
  "retain": {
    "type": "period",
    "period": "P30D"
  }
}
[
  {
    "type": "loadByPeriod",
    "period": "PT1H",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "dropBeforeByPeriod",
    "period": "P30D"
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 0
    }
  }
]
Drop all data older
than 30 days from the hot tier.
{
  "hot": {
    "type": "period",
    "period": "P30D"
  }
}
[
  {
    "type": "loadByPeriod",
    "period": "P30D",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 0
    }
  }
]
Delete all data older
than 60 days.
{
  "retain": {
    "type": "period",
    "period": "P60D"
  }
}
[
  {
    "type": "dropBeforeByPeriod",
    "period": "P60D"
  },
  {
    "type": "loadForever"
  }
]

Extensibility & Maintainability

Similar to the above period-based policies, we can add interval-based and custom tiered-policies for more advanced users. For example: a. Interval-based policy:

{
  "hot": {
    "type": "intervals",
    "intervals": ["2020-01-01/2022-01-01", "2023-01-01/9999-01-01"]
  }
}

b. Custom-tiered policy:

{
  "hot": {
    "type": "tiered",
    "tiers": {
      "hot1": {"type": "period", "period": "P60D"},
      "hot2": {"type": "period", "period": "P90D"}
    }
  },
  "retain": {"type": "period", "period": "P1Y"}
}

The API will need to translate user-defined storage policies to rules as we extend support to cover more complex use cases.

High-level implementation

The API implementation will support POST, GET and DELETE operations to create, retrieve and delete any configured storage policy per data source. Similar to the rules endpoint, this new endpoint should be on the coordinator and should return appropriate error/status codes to the user. The implementation of the API will:

Rationale

The main benefit of the API is that it abstracts away the complex inner workings of load, drop and kill rules. It provides a declarative interface to think about retention like many systems offer.

Operational impact

Since this API-only change leverages the existing load/drop rule functionality, nothing needs to be deprecated in short order. If it makes sense to deprecate the rules API at some point because the new API is equally powerful, then we may consider that.

Future work

In environments with multiple hot tiers, users must manually enumerate the tiers in the tieredReplicants if they use load rules. We can extend the storage policy API to automatically list all the tiers by default if it's not supplied.

gianm commented 1 year ago

Some initial thoughts:

From the "operational impact" section, it sounds like there is no storage as part of this proposal, just syntax sugar on load/drop rules. Is this correct? If so:

Additionally, did you consider options where the storage policy is a real object that is stored, perhaps in the new (& currently not-really-used) catalog? In that case the API would be through CatalogResource. Curious what you see as the pros and cons of these two approaches.

vogievetsky commented 1 year ago

Overall I really like the design.

Some thoughts and questions: