Dask: Refactor and improve groupby-dask

milesgranger commented 8 months ago

Hi,

This takes off from #58, I suppose we can close that in favor of this one.

I thought with all the repeated code that errors could (and did occur) so I've refactored it a bit. Each 'question' gets its own function, the calls for garbage collection and other per-run code has been moved into a decorator bench.

Turned out on current master, some of the id columns are set to int32 dtypes which results in the publicized benchmarks reporting dask as having an internal error because of a bad implementation. This is unfortunate.

Additionally, I've changed the hard-coded dtypes to be inferred by dask, this results in no categorical dtypes which further improves performance.

I've taken the liberty of re-running the benchmarks (0.5, 5GB and 50GB) on a c6i.metal instance and confirm all run with the following caveats:

Query 8 "largest two v3 by id6" is missing an API in dask for SeriesGroupBy.nlargest which will make things much better. I'll link to the following issue later. The unknown here, is should I leave it commented out or add a NotImplementedError?
Query 10 "sum v3 count by id1:id6" did fail on 50GB which we'll try to find time to address later as well.

Thank you!

jangorecki commented 8 months ago

This refactoring was already discussed multiple times. You can browse h2oai repo, or possibly also this one. In short, the aim was to make scripts to be easily reproducible line-by-line interactively, having the code easily matching between languages and solutions in different scripts. Wrapping into extra functions makes it impossible. This have very practical aspects, as it turned out, far in the past some solutions were returning different results for different run of the same query. Having scripts exactly matching between different tools that you can compare against is very handy.

milesgranger commented 8 months ago

Okay, I'll remove that refactoring then.

But just so I understand, this:

question = "sum v1 by id1" # q1
gc.collect()
t_start = timeit.default_timer()
ans = x.groupby('id1', dropna=False, observed=True).agg({'v1':'sum'}).compute()
ans.reset_index(inplace=True) # #68
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
m = memory_usage()
t_start = timeit.default_timer()
chk = [ans.v1.sum()]
chkt = timeit.default_timer() - t_start
write_log(task=task, data=data_name, in_rows=in_rows, question=question, out_rows=ans.shape[0], out_cols=ans.shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()
t_start = timeit.default_timer()
ans = x.groupby('id1', dropna=False, observed=True).agg({'v1':'sum'}).compute()
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
m = memory_usage()
t_start = timeit.default_timer()
chk = [ans.v1.sum()]
chkt = timeit.default_timer() - t_start
write_log(task=task, data=data_name, in_rows=in_rows, question=question, out_rows=ans.shape[0], out_cols=ans.shape[1], solution=solution, version=ver, git=git, fun=fun, run=2, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
print(ans.head(3), flush=True)
print(ans.tail(3), flush=True)
del ans

Is preferable to this:

@bench("sum v1 by id1")  # q1
def sum_v1_by_id1(x, client):
    ans = x.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"}).compute()
    ans.reset_index(inplace=True)  # #68
    return ans

Wrapping into extra functions makes it impossible.

From my viewpoint, it'd improve what you describe as the objective here. Better to discern differences/comparison if each implementation used basic functions to delineate their implementation and not surrounded by repeated blocks code.

But alright, in the end it's up to the owners, I'll respect that. :+1:

jangorecki commented 8 months ago

Yes, preferred. We preferred to be verbose and easily matching between solutions, and be able to interactively run line-by-line. But,...I am not official maintainer anymore, so I just say why it was made like that, and why it was (and IMO still is) considered good :) My opinion has to be biased probably due to amount of work I had to put into debugging issues in almost every tool added, at least in the early days (project started in 2016).

milesgranger commented 8 months ago

@Tmonster can I get your review when you have time? Thanks!

Tmonster commented 8 months ago

Hi @milesgranger ,

Seems like dask still has some issues when running the mini benchmark CI. Could you take a look?

Thanks

milesgranger commented 7 months ago

Sorry 'bout that; before the refactor q9 was also fixed but forgotten after reverting. Should be okay now. :)

Tmonster commented 7 months ago

Hi Miles,

Did a quick run of the benchmarks last week. Will update the report this week which will include the renaming of arrow to R-arrow and the new click house results.

milesgranger commented 7 months ago

Sounds great! I'm assuming it included the fixes in this PR? Is there anything you need from me to merge this?

milesgranger commented 7 months ago

gentle ping @Tmonster

Tmonster commented 7 months ago

Hi @milesgranger,

Yes the updates are with the changes from this PR. Don't need anything from you for merging. Should be able to update the report today.

Tmonster commented 7 months ago

New dask results are up!

duckdblabs / db-benchmark

Dask: Refactor and improve groupby-dask #64