Closed sploiselle closed 4 years ago
@benesch
Ok, the initial results are in. I captured this output from top
on my Linux VM w/ 4GB ram and 1GB swap:
Mem total:4041044 anon:3784716 map:3988 free:87516
slab:69692 buf:524 cache:5648 dirty:24 write:0
Swap total:1048572 free:0
PID VSZ^VSZRW^ RSS (SHR) DIRTY (SHR) STACK COMMAND
66958 5612m 4673m 2825m 4 2824m 0 132 materialized -w1 --bootstrap-sql=/usr/local/etc/materialized/bootstrap.sql
67994 6272m 1467m 354m 4 353m 0 136 /etc/alternatives/jre/bin/java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dc
67580 5309m 1281m 93068 100 92624 0 132 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+Pri
67246 4700m 709m 14644 104 14472 0 132 java -Xmx512M -Xms512M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+P
66669 1696m 565m 294m 4 293m 0 132 mysqld
Notice how materialized is using 3gb memory and the VM is totally maxed out on physical memory and swap. This causes everything to grind to a complete halt.
AFAICT no actual queries were run against materialized. This is literally all materialized booting and installing the logging views in bootstrap.sql.
If I bump the VM size up to 16GB of memory, materialized grinds for what seemed like about 5m, working its way up to about 8GB RSS. During this time I was unable to PEEK/SELECT any of the logging views; afterwards everything seems ok and we're hovering at about 500MB RSS.
During the overloaded period I caught this output in the logs
materialized_1 | Massive deficit OrdVal::work: -1965142
materialized_1 | Massive deficit OrdVal::work: -1965142
which seems to be logged by differential whenever there is a massive amount of work to do. I suspect we are planning something in bootstrap.sql very, very poorly.
Does removing the —bootstrap-sql parameter from docker-compose.yml result in decreased memory use?
I can answer your question backwards: $ cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql
on my Mac results in 8GB of memory use on boot. Whereas not specifying --bootstrap-sql
results in 5MB (!!!!) of memory use.
Paging @frankmcsherry. Might this be related to the new JOIN code?
Weirdly enough, this only happens on boot. Repros instantly if you run rm -rf mzdata/catalog && cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql
. But then if you leave that catalog in place and run cargo run --release -- --bootstrap-sql misc/dist/etc/materialized/bootstrap.sql
, where the same views are installed via catalog persistence, there is no runaway memory usage whatsoever. Something strange is afoot, but I'm also well past the point where I can efficiently debug. @frankmcsherry can I pass the baton to you?
Also the memory usage doesn't seem to be related to any view in bootstrap.sql in particular. Removing any of the view saves you anywhere from 500MB to 1-2GB in the fresh boot scenario.
In the meantime, the workaround is to just ditch bootstrap.sql.
Pretty neat! I'll take a peek, but there was a recent change (ping @wangandi) that has an associated theoretical performance regression (lifting filters if could enable arrangement use) that could be in play. The differential warning gets printed when there is a surprisingly large amount of work for a single key, which suggests cross-join like behavior (in compaction though; it means a massive key caused compaction to have a big latency spike).
I'll walk through the views and EXPLAIN
each of them to check for insanity. The "works fine after restart" things is very weird. I'm not sure I get what that is about... (what's different?).
Finished release [optimized + debuginfo] target(s) in 12m 55s
Manually loading the views in the bootstrap increased the memory footprint but did not cause runaway anything. The most surprising one was:
CREATE VIEW mz_perf_dependency_frontiers AS
SELECT DISTINCT
coalesce(mcn.name, view_deps.dataflow) as dataflow,
coalesce(mcn_source.name, frontier_source.global_id) as source,
frontier_source.time - frontier_df.time as lag_ms
FROM
mz_materialization_dependencies view_deps
JOIN mz_view_frontiers frontier_source ON view_deps.source = frontier_source.global_id
JOIN mz_view_frontiers frontier_df ON view_deps.dataflow = frontier_df.global_id
LEFT JOIN mz_catalog_names mcn ON mcn.global_id = view_deps.dataflow
LEFT JOIN mz_catalog_names mcn_source ON mcn_source.global_id = frontier_source.global_id;
which caused a 1GB spike. It then settled back down. Similarly,
-- operator operator is due to issue #1217
CREATE VIEW mz_perf_arrangement_records AS
SELECT mas.worker, name, records, operator operator
FROM mz_arrangement_sizes mas
JOIN mz_dataflow_operators mdo ON mdo.id = mas.operator;
spiked up almost a GB and then settled again.
This might have interesting information:
mcsherry=> peek mz_records_per_dataflow_global;
id | name | records
------+----------------------------------------------------------------+---------
167 | Dataflow: materialize.public.mz_addresses_with_unit_length | 2452
251 | Dataflow: materialize.public.mz_dataflow_names | 1706384
395 | Dataflow: materialize.public.mz_dataflow_operator_dataflows | 1706936
541 | Dataflow: materialize.public.mz_records_per_dataflow_operator | 1827
651 | Dataflow: materialize.public.mz_records_per_dataflow | 521
777 | Dataflow: materialize.public.mz_records_per_dataflow_global | 384
861 | Dataflow: materialize.public.mz_perf_dependency_frontiers | 833
1112 | Dataflow: materialize.public.mz_perf_peek_durations_core | 18
1215 | Dataflow: materialize.public.mz_perf_peek_durations_bucket | 37
1341 | Dataflow: materialize.public.mz_perf_peek_durations_aggregates | 9
(10 rows)
It looks like some of these views are just pretty massive.
My best guess (will try to substantiate) is that this is not exactly cross joins, but something similar where during start-up we flow forward all of the current state (including a 1 minute history), which spikes up memory use as we do, but which eventually settles back down once we have moved it all through the system. Logging happens to work at a very high rate, and several of the things happening churn arrangement state which lead in to these views which further churn arrangement state, ... right.
This doesn't explain why the recovered version doesn't have the problem, unless perhaps it gets installed so fast that there is no history to speak of, maybe?
Let's look at the first big one here:
CREATE VIEW mz_dataflow_names AS
SELECT
mz_dataflow_operator_addresses.id,
mz_dataflow_operator_addresses.worker,
mz_dataflow_operator_addresses.value as local_id,
mz_dataflow_operators.name
FROM
mz_dataflow_operator_addresses,
mz_dataflow_operators,
mz_addresses_with_unit_length
WHERE
mz_dataflow_operator_addresses.id = mz_dataflow_operators.id AND
mz_dataflow_operator_addresses.worker = mz_dataflow_operators.worker AND
mz_dataflow_operator_addresses.id = mz_addresses_with_unit_length.id AND
mz_dataflow_operator_addresses.worker = mz_addresses_with_unit_length.worker AND
mz_dataflow_operator_addresses.slot = 0;
If we explain the plan, we get:
Plan
----------------------------------------------------------------------
Project { +
outputs: [0, 1, 3, 8], +
Join { +
variables: [[(0, 0), (1, 0), (2, 0)], [(0, 1), (1, 1), (2, 1)]],+
Filter { +
predicates: [#2 = 0], +
Get { mz_catalog.mz_dataflow_operator_addresses (s5) } +
}, +
ArrangeBy { +
keys: [[#0, #1]], +
Get { materialize.public.mz_addresses_with_unit_length (u1) } +
}, +
ArrangeBy { +
keys: [[#0, #1]], +
Get { mz_catalog.mz_dataflow_operators (s1) } +
} +
} +
}
(1 row)
So this is a 3-way join, with three relations whose current sizes (at the same time as the 1.7M record query) are:
mcsherry=> select count(*) from mz_catalog.mz_dataflow_operator_addresses where mz_dataflow_operator_addresses.slot = 0;
count
-------
1205
(1 row)
mcsherry=> select count(*) from materialize.public.mz_addresses_with_unit_length;
count
-------
28
(1 row)
mcsherry=> select count(*) from mz_catalog.mz_dataflow_operators;
count
-------
574
(1 row)
Now, these numbers are the current counts, rather than the history sizes, and presumably there is some history there. Though, my CPU is at about 3% load, so it's not like the logging views are churning away. But it does seem like a disproportionately large amount of records backing this particular dataflow. The join plan does not read as a cross-join, but perhaps there is a glitch in rendering (new deltaquery stuff not landed yet, so not that).
Investigation ongoing...
Gah... added a println to rendering, and
Finished release [optimized + debuginfo] target(s) in 7m 05s
Looks like we were rendering un-optimized dataflows. The catalog seems to have stashed the optimized dataflow though, so on recovery we rendered the optimized dataflow.
Using commit
119833afc92b1b609b9e9616c2f4e920eafb54e7
:My Docker environment often hangs when running
./dc.sh demo-load
If I make it past this step, I then run:
If I wait 5 minutes after this, my Docker environment crashes or
materialized
crashes.The only logs I've been able to gather were from
materialized
, which ended with: