0dt/m1: make envd startup fast enough to be imperceptible

jkosh44 commented 1 month ago

Feature request

During an upgrade the time spent on startup is all time contributing to downtime. We need to reduce startup times to reduce downtime.

### Work items
- [ ] https://github.com/MaterializeInc/cloud/issues/10024
- [ ] https://github.com/MaterializeInc/materialize/issues/29357
- [ ] https://github.com/MaterializeInc/materialize/pull/29226
- [ ] https://github.com/MaterializeInc/materialize/pull/29302
- [ ] https://github.com/MaterializeInc/materialize/pull/29378
- [ ] https://github.com/MaterializeInc/materialize/pull/29381
- [ ] https://github.com/MaterializeInc/materialize/pull/29387
- [ ] https://github.com/MaterializeInc/materialize/pull/29403
- [ ] https://github.com/MaterializeInc/materialize/pull/29081
- [ ] https://github.com/MaterializeInc/materialize/pull/29127
- [ ] https://github.com/MaterializeInc/materialize/pull/29210
- [ ] https://github.com/MaterializeInc/materialize/pull/28208
- [ ] https://github.com/MaterializeInc/materialize/pull/28612
- [ ] https://github.com/MaterializeInc/materialize/pull/28868
- [ ] https://github.com/MaterializeInc/materialize/pull/28978
- [ ] https://github.com/MaterializeInc/materialize/pull/28981
- [ ] https://github.com/MaterializeInc/materialize/pull/28984
- [ ] https://github.com/MaterializeInc/materialize/pull/29410

jkosh44 commented 1 month ago

Here are the logs from the staging env that took the longest to startup at 38 seconds (old format): staging_logs.txt

The reason each log appears twice is because the first set is from read-only mode while the second set is from read-write mode.

The prod sandbox started up in 6 seconds, so probably not as interesting as the attached staging logs.

jkosh44 commented 1 month ago

The following hacky python snippet will sort the logs by time (you may have to play around with the sep value and the arg to main):

The following hacky python snippet will sort the logs by time (you may have to play around with the `sep` value and the arg to `main`):

```Python
import re
from typing import Optional
from dateutil.parser import parse

order = [
    r'startup: envd init: beginning',
    r'startup: envd init: preamble beginning',
    r'startup: envd init: preamble complete in',
    r'startup: envd init: serving beginning',
    r'startup: envd serve: beginning',
    r'startup: envd serve: preamble beginning',
    r'startup: envd serve: preamble complete in',
    r'startup: envd serve: catalog init beginning',
    r'startup: envd serve: catalog init complete in',
    r'startup: envd serve: system parameter sync beginning',
    r'startup: envd serve: system parameter sync complete in',
    r'startup: envd serve: preflight checks beginning',
    r'startup: envd serve: preflight checks complete in',
    r'startup: envd serve: durable catalog open beginning',
    r'startup: envd serve: durable catalog open complete in',
    r'startup: envd serve: coordinator init beginning',
    r'startup: coordinator init: beginning',
    r'startup: coordinator init: preamble beginning',
    r'startup: coordinator init: preamble complete in',
    r'startup: coordinator init: timestamp oracle init beginning',
    r'startup: coordinator init: timestamp oracle init complete in',
    r'startup: coordinator init: catalog open beginning',
    r'startup: coordinator init: catalog open complete in',
    r'startup: coordinator init: coordinator thread start beginning',
    r'startup: controller init: beginning',
    r'startup: controller init: complete in',
    r'startup: coordinator init: bootstrap beginning',
    r'startup: coordinator init: bootstrap: preamble beginning',
    r'startup: coordinator init: bootstrap: preamble complete in',
    r'startup: coordinator init: bootstrap: storage collections init beginning',
    r'startup: coordinator init: bootstrap: storage collections init complete in',
    r'startup: coordinator init: bootstrap: optimize dataflow plans beginning',
    r'startup: coordinator init: bootstrap: optimize dataflow plans complete in',
    r'startup: coordinator init: bootstrap: dataflow as-of bootstrapping beginning',
    r'startup: coordinator init: bootstrap: dataflow as-of bootstrapping complete in',
    r'startup: coordinator init: bootstrap: postamble beginning',
    r'startup: coordinator init: bootstrap: postamble complete in',
    r'startup: coordinator init: bootstrap: generate builtin updates beginning',
    r'startup: coordinator init: bootstrap: generate builtin updates complete in',
    r'startup: coordinator init: bootstrap: generate secret cleanup beginning',
    r'startup: coordinator init: bootstrap: generate secret cleanup complete in',
    r'startup: coordinator init: bootstrap: concurrently update builtin tables and cleanup secrets beginning',
    r'startup: coordinator init: bootstrap: concurrently update builtin tables and cleanup secrets complete in',
    r'startup: coordinator init: bootstrap complete in',
    r'startup: coord serve: storage usage prune beginning',
    r'startup: coord serve: storage usage prune complete in',
    r'startup: coordinator init: coordinator thread start complete in',
    r'startup: coordinator init: complete in',
    r'startup: envd serve: coordinator init complete in',
    r'startup: envd serve: postamble beginning',
    r'startup: envd serve: postamble complete in',
    r'startup: envd serve: complete in',
    r'startup: envd init: serving complete in',
    r'startup: envd init: complete in',
]

def main(filename: str):
    logs = sort_logs(filename)
    logs = '\n'.join(logs)
    with open(filename, "w") as file:
        file.write(logs)

def sort_logs(filename: str):
    with open(filename, "r") as file:
        lines = [line.strip() for line in file if re.match(r'\d+-\d+-\d+ \d+:\d+:\d+.\d+', line) and log_idx(line) is not None]
        lines.sort(key=sort_log)
        return lines

def sort_log(line: str):
    sep = "\t: "
    split_line = line.split(sep, 1)
    d = parse(split_line[0])
    idx = log_idx(line)
    return (d, idx)

def log_idx(line: str) -> Optional[int]:
    for idx, template in enumerate(order):
        pattern = re.compile(template)
        if pattern.search(line):
            return idx

    return None

if __name__ == "__main__":
    main("/home/joe/Downloads/us-prod-startup-0896c.txt")

jkosh44 commented 1 month ago

The startup process is broken down into the following categories, each with their own timings

run [mz_environmentd::run()
- preamble
- serving [mz_environmentd::Listeners::serve()]
- preamble
- catalog init
- system parameter sync
- preflight checks
- durable catalog open
- coordinator init [mz_adapter::coord::serve()]
  - preamble
  - timestamp oracle init
  - catalog open
  - coordinator thread start
  - controller init [mz_adapter::catalog::open::Catalog::initialize_controller()]
  - bootstrap [mz_adapter::coord::Coordinator::bootstrap()]
    - preamble
    - storage collections init
    - optimize dataflow plans
    - postamble
    - concurrently update builtin tables and cleanup secrets
- postamble

Note: these exact sections were introduced in https://github.com/MaterializeInc/materialize/pull/28978

jkosh44 commented 4 weeks ago

Attached is the startup timings in prod for the customer with the largest catalog (old format).

There are two major differences with staging due to 0dt being enabled in staging but not prod:

In staging we go through every startup step twice.
In prod the preflight checks are much longer because they involve opening the catalog in savepoint mode.

prod_logs.txt

jkosh44 commented 3 weeks ago

mz_storage_usage_by_shard is currently a builtin table that is backed by the storage usage collection collection in the durable catalog. The storage usage collection takes up 99% of the space in the durable catalog. The mz_storage_usage_by_shard table has the following columns:

id uint8
shard_id text
size_bytes uint8
collection_timestamp timestamptz

In order to have correct billing logic, we need the following gaurantees:

collection_timestamp should be close to the real time that the row was written to the table.
collection_timestamp should be less than or equal to the logical timestamp that the row was written. i.e. If I run SELECT * FROM mz_storage_usage_by_shard AS OF t then I should see all rows that will ever exist with a collection_timestamp <= t.
collection_timestamp should be monotonically increasing.

The way we currently maintain these gaurantees is by first generating the collection_timestamp via the timestamp oracle, and then generating the logical write timestamp via the timestamp oracle.

In order to speed up startup times, we'd like to remove the storage usage collection from the durable catalog. Here are some proposals for that in the order of my preference for each proposal:

Keep mz_storage_usage_by_shard as a builtin table, but prevent anyone from ever migrating it and don't truncate it on startup. This seems like the simplest approach, however we tried to do the same thing with the query activity log and ultimately decided to make it a builtin source.
Convert mz_storage_usage_by_shard to an append only builtin source. We'll have to prevent anyone from ever migrating this source to avoid accidently truncating it. Additionally we can update the bakground builtin source task so that it can accept a closure of the form Fn(Timestamp) -> Vec<(Row, Diff)> instead of literal writes. The source would pass in the logical timestamp of a write to generate rows for the write. The caller would pass in a closure to set the collection_timestamp column equal to the logical write timestamp.
Remove the collection_timestamp column from the table and increase the retention window to 31 days. If anyone wants to know the collection timestamp, then they can use mz_now().
Separate the persist catalog shard into two shards, one for storage_usage and one for everything else. We'll use txn-wal to read/write from/to the shards.

For all approaches we'll have a background task that occasionally removes entries older than 30 days.

benesch commented 3 weeks ago

Keep mz_storage_usage_by_shard as a builtin table, but prevent anyone from ever migrating it and don't truncate it on startup. This seems like the simplest approach, however we tried to do the same thing with the query activity log and ultimately decided to make it a builtin source.

This is my preference as well, especially since our forensic analysis has indicated that making the query history a builtin source was likely simply a result of not realizing that we could have builtin tables that didn't need to be truncated on startup.

aljoscha commented 3 weeks ago

Keep mz_storage_usage_by_shard as a builtin table, but prevent anyone from ever migrating it and don't truncate it on startup. This seems like the simplest approach, however we tried to do the same thing with the query activity log and ultimately decided to make it a builtin source.

This would also be my first preference, yes!

jkosh44 commented 1 week ago

I collected some parsing, planning, and optimization timings from my staging account. I created 1000 tables of the form CREATE TABLE t{i} (a INT);, 1000 views of the form CREATE VIEW v{i} AS SELECT SUM(a) FROM t{i}, and 1000 indexes of the form CREATE DEFAULT INDEX ON v{i}. The total startup time is 17 seconds. Of thost 17 seconds, we spend the following amount of time on the following activities:

parsing: 66.938605 ms
name resolution: 34.065788 ms
planning: 232.109582 ms
local MIR optimization: 3.251201962 s
global MIR optimization: 1.039267113 s
MIR to LIR lowering: 92.030407 ms

def- commented 1 week ago

So what is most of the time spent on? Your activities only seem to sum up to ~5 s of the 17 s total.

jkosh44 commented 1 week ago

So what is most of the time spent on? Your activities only seem to sum up to ~5 s of the 17 s total.

I'm having trouble with logs getting dropped in my staging account, so I may have missed a couple of events, but here are the other large activities:

Reading catalog persist shard: 745.757435 ms
AST migrations: 2.826083795 s
Controller initialization: 2.876442232 s
Storage collections initialization: 5.590255804 s

MaterializeInc / materialize

0dt/m1: make envd startup fast enough to be imperceptible #28722

Feature request