MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Can no longer run `ex/chbench` #1589

Closed sploiselle closed 4 years ago

sploiselle commented 4 years ago

Using commit 119833afc92b1b609b9e9616c2f4e920eafb54e7:

cd ex/chbench
./dc.sh up :init:
./dc.sh up :demo:
./dc.sh demo-load

My Docker environment often hangs when running ./dc.sh demo-load

If I make it past this step, I then run:

docker-compose run cli
    CREATE VIEW q07 AS
    SELECT
        su_nationkey AS supp_nation,
        substr(c_state, 1, 1) AS cust_nation,
        extract('year' FROM o_entry_d) AS l_year,
        sum(ol_amount) AS revenue
    FROM
        mysql_tpcch_supplier,
        mysql_tpcch_stock,
        mysql_tpcch_orderline,
        mysql_tpcch_order,
        mysql_tpcch_customer,
        mysql_tpcch_nation AS n1,
        mysql_tpcch_nation AS n2
    WHERE
        ol_supply_w_id = s_w_id
        AND ol_i_id = s_i_id
        AND s_su_suppkey = su_suppkey
        AND ol_w_id = o_w_id
        AND ol_d_id = o_d_id
        AND ol_o_id = o_id
        AND c_id = o_c_id
        AND c_w_id = o_w_id
        AND c_d_id = o_d_id
        AND su_nationkey = n1.n_nationkey
        AND c_n_nationkey = n2.n_nationkey
    GROUP BY
        su_nationkey,
        substr(c_state, 1, 1),
        extract('year' FROM o_entry_d)
    ORDER BY
        su_nationkey, cust_nation, l_year;

If I wait 5 minutes after this, my Docker environment crashes or materialized crashes.

The only logs I've been able to gather were from materialized, which ended with:

...
materialized_1                         | [2020-01-16T21:07:01.104531Z  INFO catalog/lib:366] new index materialize.public.mz_perf_peek_durations_aggregates_primary_idx (u22)
materialized_1                         | Massive deficit OrdVal::work: -1965142
materialized_1                         | Massive deficit OrdVal::work: -1965142
materialized_1                         | Surprising, but...
materialized_1                         | [2020-01-16T21:07:50.141158Z  WARN materialized/lib:126] unknown protocol connection!
JLDLaughlin commented 4 years ago

@benesch

benesch commented 4 years ago

Ok, the initial results are in. I captured this output from top on my Linux VM w/ 4GB ram and 1GB swap:

Mem total:4041044 anon:3784716 map:3988 free:87516
 slab:69692 buf:524 cache:5648 dirty:24 write:0
Swap total:1048572 free:0
  PID   VSZ^VSZRW^  RSS (SHR) DIRTY (SHR) STACK COMMAND
66958 5612m 4673m 2825m     4 2824m     0   132 materialized -w1 --bootstrap-sql=/usr/local/etc/materialized/bootstrap.sql
67994 6272m 1467m  354m     4  353m     0   136 /etc/alternatives/jre/bin/java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dc
67580 5309m 1281m 93068   100 92624     0   132 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+Pri
67246 4700m  709m 14644   104 14472     0   132 java -Xmx512M -Xms512M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+P
66669 1696m  565m  294m     4  293m     0   132 mysqld

Notice how materialized is using 3gb memory and the VM is totally maxed out on physical memory and swap. This causes everything to grind to a complete halt.

AFAICT no actual queries were run against materialized. This is literally all materialized booting and installing the logging views in bootstrap.sql.

If I bump the VM size up to 16GB of memory, materialized grinds for what seemed like about 5m, working its way up to about 8GB RSS. During this time I was unable to PEEK/SELECT any of the logging views; afterwards everything seems ok and we're hovering at about 500MB RSS.

During the overloaded period I caught this output in the logs

materialized_1                         | Massive deficit OrdVal::work: -1965142
materialized_1                         | Massive deficit OrdVal::work: -1965142

which seems to be logged by differential whenever there is a massive amount of work to do. I suspect we are planning something in bootstrap.sql very, very poorly.

cuongdo commented 4 years ago

Does removing the —bootstrap-sql parameter from docker-compose.yml result in decreased memory use?

benesch commented 4 years ago

I can answer your question backwards: $ cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql on my Mac results in 8GB of memory use on boot. Whereas not specifying --bootstrap-sql results in 5MB (!!!!) of memory use.

Paging @frankmcsherry. Might this be related to the new JOIN code?

benesch commented 4 years ago

Weirdly enough, this only happens on boot. Repros instantly if you run rm -rf mzdata/catalog && cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql. But then if you leave that catalog in place and run cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql, where the same views are installed via catalog persistence, there is no runaway memory usage whatsoever. Something strange is afoot, but I'm also well past the point where I can efficiently debug. @frankmcsherry can I pass the baton to you?

Also the memory usage doesn't seem to be related to any view in bootstrap.sql in particular. Removing any of the view saves you anywhere from 500MB to 1-2GB in the fresh boot scenario.

In the meantime, the workaround is to just ditch bootstrap.sql.

frankmcsherry commented 4 years ago

Pretty neat! I'll take a peek, but there was a recent change (ping @wangandi) that has an associated theoretical performance regression (lifting filters if could enable arrangement use) that could be in play. The differential warning gets printed when there is a surprisingly large amount of work for a single key, which suggests cross-join like behavior (in compaction though; it means a massive key caused compaction to have a big latency spike).

I'll walk through the views and EXPLAIN each of them to check for insanity. The "works fine after restart" things is very weird. I'm not sure I get what that is about... (what's different?).

frankmcsherry commented 4 years ago
    Finished release [optimized + debuginfo] target(s) in 12m 55s
frankmcsherry commented 4 years ago

Manually loading the views in the bootstrap increased the memory footprint but did not cause runaway anything. The most surprising one was:

CREATE VIEW mz_perf_dependency_frontiers AS
SELECT DISTINCT
     coalesce(mcn.name, view_deps.dataflow) as dataflow,
     coalesce(mcn_source.name, frontier_source.global_id) as source,
     frontier_source.time - frontier_df.time as lag_ms
FROM
     mz_materialization_dependencies view_deps
JOIN mz_view_frontiers frontier_source ON view_deps.source = frontier_source.global_id
JOIN mz_view_frontiers frontier_df ON view_deps.dataflow = frontier_df.global_id
LEFT JOIN mz_catalog_names mcn ON mcn.global_id = view_deps.dataflow
LEFT JOIN mz_catalog_names mcn_source ON mcn_source.global_id = frontier_source.global_id;

which caused a 1GB spike. It then settled back down. Similarly,

-- operator operator is due to issue #1217
CREATE VIEW mz_perf_arrangement_records AS
 SELECT mas.worker, name, records, operator operator
 FROM mz_arrangement_sizes mas
 JOIN mz_dataflow_operators mdo ON mdo.id = mas.operator;

spiked up almost a GB and then settled again.

frankmcsherry commented 4 years ago

This might have interesting information:

mcsherry=> peek mz_records_per_dataflow_global;
  id  |                              name                              | records 
------+----------------------------------------------------------------+---------
  167 | Dataflow: materialize.public.mz_addresses_with_unit_length     |    2452
  251 | Dataflow: materialize.public.mz_dataflow_names                 | 1706384
  395 | Dataflow: materialize.public.mz_dataflow_operator_dataflows    | 1706936
  541 | Dataflow: materialize.public.mz_records_per_dataflow_operator  |    1827
  651 | Dataflow: materialize.public.mz_records_per_dataflow           |     521
  777 | Dataflow: materialize.public.mz_records_per_dataflow_global    |     384
  861 | Dataflow: materialize.public.mz_perf_dependency_frontiers      |     833
 1112 | Dataflow: materialize.public.mz_perf_peek_durations_core       |      18
 1215 | Dataflow: materialize.public.mz_perf_peek_durations_bucket     |      37
 1341 | Dataflow: materialize.public.mz_perf_peek_durations_aggregates |       9
(10 rows)

It looks like some of these views are just pretty massive.

My best guess (will try to substantiate) is that this is not exactly cross joins, but something similar where during start-up we flow forward all of the current state (including a 1 minute history), which spikes up memory use as we do, but which eventually settles back down once we have moved it all through the system. Logging happens to work at a very high rate, and several of the things happening churn arrangement state which lead in to these views which further churn arrangement state, ... right.

This doesn't explain why the recovered version doesn't have the problem, unless perhaps it gets installed so fast that there is no history to speak of, maybe?

frankmcsherry commented 4 years ago

Let's look at the first big one here:

CREATE VIEW mz_dataflow_names AS
SELECT
    mz_dataflow_operator_addresses.id,
    mz_dataflow_operator_addresses.worker,
    mz_dataflow_operator_addresses.value as local_id,
    mz_dataflow_operators.name
FROM
    mz_dataflow_operator_addresses,
    mz_dataflow_operators,
    mz_addresses_with_unit_length
WHERE
    mz_dataflow_operator_addresses.id = mz_dataflow_operators.id AND
    mz_dataflow_operator_addresses.worker = mz_dataflow_operators.worker AND
    mz_dataflow_operator_addresses.id = mz_addresses_with_unit_length.id AND
    mz_dataflow_operator_addresses.worker = mz_addresses_with_unit_length.worker AND
    mz_dataflow_operator_addresses.slot = 0;

If we explain the plan, we get:

                                 Plan                                 
----------------------------------------------------------------------
 Project {                                                           +
   outputs: [0, 1, 3, 8],                                            +
   Join {                                                            +
     variables: [[(0, 0), (1, 0), (2, 0)], [(0, 1), (1, 1), (2, 1)]],+
     Filter {                                                        +
       predicates: [#2 = 0],                                         +
       Get { mz_catalog.mz_dataflow_operator_addresses (s5) }        +
     },                                                              +
     ArrangeBy {                                                     +
       keys: [[#0, #1]],                                             +
       Get { materialize.public.mz_addresses_with_unit_length (u1) } +
     },                                                              +
     ArrangeBy {                                                     +
       keys: [[#0, #1]],                                             +
       Get { mz_catalog.mz_dataflow_operators (s1) }                 +
     }                                                               +
   }                                                                 +
 }
(1 row)

So this is a 3-way join, with three relations whose current sizes (at the same time as the 1.7M record query) are:

mcsherry=> select count(*) from mz_catalog.mz_dataflow_operator_addresses where mz_dataflow_operator_addresses.slot = 0;
 count 
-------
  1205
(1 row)

mcsherry=> select count(*) from materialize.public.mz_addresses_with_unit_length;
 count 
-------
    28
(1 row)

mcsherry=> select count(*) from mz_catalog.mz_dataflow_operators;
 count 
-------
   574
(1 row)

Now, these numbers are the current counts, rather than the history sizes, and presumably there is some history there. Though, my CPU is at about 3% load, so it's not like the logging views are churning away. But it does seem like a disproportionately large amount of records backing this particular dataflow. The join plan does not read as a cross-join, but perhaps there is a glitch in rendering (new deltaquery stuff not landed yet, so not that).

Investigation ongoing...

frankmcsherry commented 4 years ago

Gah... added a println to rendering, and

    Finished release [optimized + debuginfo] target(s) in 7m 05s
frankmcsherry commented 4 years ago

Looks like we were rendering un-optimized dataflows. The catalog seems to have stashed the optimized dataflow though, so on recovery we rendered the optimized dataflow.