coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

[TPC-H] `nation` and `region` partitioning is nonsensical #1380

Closed hendrikmakait closed 7 months ago

hendrikmakait commented 7 months ago

The nation and region datasets contain a single value per partition (25 and 5 values, respectively. While dask-expr and others should ideally be smart enough to detect this and act accordingly, this partitioning is nonsensical and not representative of a real workload.

milesgranger commented 7 months ago

this partitioning is nonsensical and not representative of a real workload.

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

Do we have any idea how much these tables impact performance when, for scale 1000 there are 25 nation files, and for lineitem there are 3000. Are the 25 and 5 small files really that degrading to performance? If so, it seems like that by itself is an issue, no?


Edit: I see the perceived performance impact in #1381, but I'd still maintain this isn't such a deviation from real world workloads. Maybe in a transformed layer this would be weird, but certainly not for raw layers.

hendrikmakait commented 7 months ago

I don't know. I've seen lots of very imbalanced partitioning on raw data layers. Some vendors often dumping small updates into a bucket at hourly increments then combined with other vendor's larger weekly deposits; not so dissimilar to what we see here IMO.

I see your point about updates, but these are dimension tables that are slowly to never changing. I see a point for imbalanced data on basically all other tables, but for these two, I consider this a bad practice (and I would've had a very stern talk with that engineer back in my days working in business intelligence).

milesgranger commented 7 months ago

and I would've had a very stern talk with that engineer back in my days working in business intelligence

Sometimes people don't have the luxury of telling vendors what size and when they get their data, it's outside of the organization, and is common. The job of the organization's data engineers is to transform this "bad" unoptimized data files/sources into something "good" for the organization to use, by using engines like Dask.

phofl commented 7 months ago

Let’s open an issue that we should be able to detect those and then use a more sensible partitioning.

the idea of what we are doing at the moment is to identify potential problems, if a very weird partitioning hides other issues then we don‘t gain anything and shoot ourselves in the foot

hendrikmakait commented 7 months ago

As an aside: The nation and region datasets are not partitioned if you generate the data locally.

I had a quick offline discussion with @milesgranger and we agree that:

Where we disagree is on the question whether to merge: I'm advocating for it since it is unrealistic for this benchmark and does not match the data generated locally. @milesgranger is advocating against it because it highlights things that Dask is bad at and will come up in other workloads (like incremental updates).

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

phofl commented 7 months ago

I'd still suggest merge the partitions, and additionally add a ticket as suggested by @phofl, as well as re-test this once we have become better at estimating sizes/cost in dask-expr.

FWIW we could do this today in dask-expr, but we need information from read_parquet to use a more sensible partitioning. It's very similar to what we do with projections. Touching the parquet implementation before we rewrite it seems not smart though

hendrikmakait commented 7 months ago

I've created an upstream issue: https://github.com/dask-contrib/dask-expr/issues/869

hendrikmakait commented 7 months ago

For now, I've manually replaced the partitioned datasets with unpartitioned ones to move forward with benchmarking. I have created https://github.com/coiled/benchmarks/issues/1386 to track the work necessary to update the data generation script.