follow the planetiler architecture (which has the best performance today for huge inputs, right?): loop over each feature, loop over each zoom level, writes encoded per-tile intermediate output somewhere. Then the sorting + feature dropping + compression + writing happens in a later pass.
Currently this iterates over tiles (in parallel), uses an rtree to find all matching features, then builds each tile.
What about: