New storage / retention policy API in Druid

abhishekrb19 commented 1 year ago

Motivation

While Druid's rules (load, drop, and broadcast rules) and kill tasks are powerful, they can be complex to use and understand, especially in the context of retention. Druid users need to think about the lifecycle of segments (used/unused), map to tiered replicants, and add the appropriate imperative rules in the correct order to the rule chain.

Proposed changes

At a high level, users can define a storage policy for the hot tier (aka historical tier) and the deep storage. To that effect, introduce a storage policy API that translates user-defined policies to one or more load and drop rules under the hoods.

New API `/druid/coordinator/v1/storagePolicy/<dataSource>`

The API will accept two parameters in the create payload:

hot: Defines how long to keep the data in the hot tier(s) (aka historical tiers)
retain: Defines how long to retain the data before it's cleaned up permanently, including data from the deep storage and metadata store

Translation of storage policy to load & drop rules

A few use cases along with the storage policy payloads and the corresponding internal load/drop rules is shown below:

Intent	Storage Policy	Load/Drop Rule
Keep the most recent hour of data in the hot tier and permanently delete all data older than 30 days.	{ "hot": { "type": "period", "period": "PT1H" }, "retain": { "type": "period", "period": "P30D" } }	[ { "type": "loadByPeriod", "period": "PT1H", "tieredReplicants": { "_default_tier": 1 } }, { "type": "dropBeforeByPeriod", "period": "P30D" }, { "type": "loadForever", "tieredReplicants": { "_default_tier": 0 } } ]
Drop all data older than 30 days from the hot tier.	{ "hot": { "type": "period", "period": "P30D" } }	[ { "type": "loadByPeriod", "period": "P30D", "tieredReplicants": { "_default_tier": 1 } }, { "type": "loadForever", "tieredReplicants": { "_default_tier": 0 } } ]
Delete all data older than 60 days.	{ "retain": { "type": "period", "period": "P60D" } }	[ { "type": "dropBeforeByPeriod", "period": "P60D" }, { "type": "loadForever" } ]

Intent

Storage Policy

Load/Drop Rule

Keep the most recent hour of data
in the hot tier and permanently delete
all data older than 30 days.

{
  "hot": {
    "type": "period",
    "period": "PT1H"
  },
  "retain": {
    "type": "period",
    "period": "P30D"
  }
}

[
  {
    "type": "loadByPeriod",
    "period": "PT1H",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "dropBeforeByPeriod",
    "period": "P30D"
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 0
    }
  }
]

Drop all data older
than 30 days from the hot tier.

{
  "hot": {
    "type": "period",
    "period": "P30D"
  }
}

[
  {
    "type": "loadByPeriod",
    "period": "P30D",
    "tieredReplicants": {
      "_default_tier": 1
    }
  },
  {
    "type": "loadForever",
    "tieredReplicants": {
      "_default_tier": 0
    }
  }
]

Delete all data older
than 60 days.

{
  "retain": {
    "type": "period",
    "period": "P60D"
  }
}

[
  {
    "type": "dropBeforeByPeriod",
    "period": "P60D"
  },
  {
    "type": "loadForever"
  }
]

Extensibility & Maintainability

Similar to the above period-based policies, we can add interval-based and custom tiered-policies for more advanced users. For example: a. Interval-based policy:

{
  "hot": {
    "type": "intervals",
    "intervals": ["2020-01-01/2022-01-01", "2023-01-01/9999-01-01"]
  }
}

b. Custom-tiered policy:

{
  "hot": {
    "type": "tiered",
    "tiers": {
      "hot1": {"type": "period", "period": "P60D"},
      "hot2": {"type": "period", "period": "P90D"}
    }
  },
  "retain": {"type": "period", "period": "P1Y"}
}

The API will need to translate user-defined storage policies to rules as we extend support to cover more complex use cases.

High-level implementation

The API implementation will support POST, GET and DELETE operations to create, retrieve and delete any configured storage policy per data source. Similar to the rules endpoint, this new endpoint should be on the coordinator and should return appropriate error/status codes to the user. The implementation of the API will:

Validate ISO 8601 periods
Validate Interval strings
Check that hot.period cannot be larger than retain.period
Disallow retain if auto-kill configuration is disabled

Rationale

The main benefit of the API is that it abstracts away the complex inner workings of load, drop and kill rules. It provides a declarative interface to think about retention like many systems offer.

Operational impact

Since this API-only change leverages the existing load/drop rule functionality, nothing needs to be deprecated in short order. If it makes sense to deprecate the rules API at some point because the new API is equally powerful, then we may consider that.

Future work

In environments with multiple hot tiers, users must manually enumerate the tiers in the tieredReplicants if they use load rules. We can extend the storage policy API to automatically list all the tiers by default if it's not supplied.

gianm commented 1 year ago

Some initial thoughts:

From the "operational impact" section, it sounds like there is no storage as part of this proposal, just syntax sugar on load/drop rules. Is this correct? If so:

what happens on GET if there are some load/drop rules for a datasource that can't be mapped onto a storage policy object? Or, is it possible for all combinations of load/drop rules to be mapped onto storage policy?
how are the cluster-wide default load/drop rules dealt with?

Additionally, did you consider options where the storage policy is a real object that is stored, perhaps in the new (& currently not-really-used) catalog? In that case the API would be through CatalogResource. Curious what you see as the pros and cons of these two approaches.

vogievetsky commented 1 year ago

Overall I really like the design.

Some thoughts and questions:

Will there be an API to fetch all the storage policies together? The equivalent of GET /druid/coordinator/v1/rules that is needed for the console.
Like Gian I want to know what will actually be stored in the DB? Will setting a storagePolicy actually just write some new load rules that I will be able to see if I query the load rules directly? What if I set a storagePolicy and then write a load rule will it effectively "update" the storage policy? I am 👍 for an approach where this is all just sugar on top of existing load rules and the storage policy is not a real object that is stored.
Disallow retain if auto-kill configuration is disabled I have issues with that. What if you set retain and then disable the auto-kill. How will the UI know if auto-kill is enabled or not (so as to know to render the retain controls or not)
What is going on here:
What happens if hot is not set is everything hot or nothing is hot? I think the load rule part of the example suggests that everything is hot?
What would the storage policy for the current default of everything is hot look like?

apache / druid