Closed milesgranger closed 7 months ago
This refactoring was already discussed multiple times. You can browse h2oai repo, or possibly also this one. In short, the aim was to make scripts to be easily reproducible line-by-line interactively, having the code easily matching between languages and solutions in different scripts. Wrapping into extra functions makes it impossible. This have very practical aspects, as it turned out, far in the past some solutions were returning different results for different run of the same query. Having scripts exactly matching between different tools that you can compare against is very handy.
Okay, I'll remove that refactoring then.
But just so I understand, this:
question = "sum v1 by id1" # q1
gc.collect()
t_start = timeit.default_timer()
ans = x.groupby('id1', dropna=False, observed=True).agg({'v1':'sum'}).compute()
ans.reset_index(inplace=True) # #68
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
m = memory_usage()
t_start = timeit.default_timer()
chk = [ans.v1.sum()]
chkt = timeit.default_timer() - t_start
write_log(task=task, data=data_name, in_rows=in_rows, question=question, out_rows=ans.shape[0], out_cols=ans.shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()
t_start = timeit.default_timer()
ans = x.groupby('id1', dropna=False, observed=True).agg({'v1':'sum'}).compute()
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
m = memory_usage()
t_start = timeit.default_timer()
chk = [ans.v1.sum()]
chkt = timeit.default_timer() - t_start
write_log(task=task, data=data_name, in_rows=in_rows, question=question, out_rows=ans.shape[0], out_cols=ans.shape[1], solution=solution, version=ver, git=git, fun=fun, run=2, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
print(ans.head(3), flush=True)
print(ans.tail(3), flush=True)
del ans
Is preferable to this:
@bench("sum v1 by id1") # q1
def sum_v1_by_id1(x, client):
ans = x.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"}).compute()
ans.reset_index(inplace=True) # #68
return ans
Wrapping into extra functions makes it impossible.
From my viewpoint, it'd improve what you describe as the objective here. Better to discern differences/comparison if each implementation used basic functions to delineate their implementation and not surrounded by repeated blocks code.
But alright, in the end it's up to the owners, I'll respect that. :+1:
Yes, preferred. We preferred to be verbose and easily matching between solutions, and be able to interactively run line-by-line. But,...I am not official maintainer anymore, so I just say why it was made like that, and why it was (and IMO still is) considered good :) My opinion has to be biased probably due to amount of work I had to put into debugging issues in almost every tool added, at least in the early days (project started in 2016).
@Tmonster can I get your review when you have time? Thanks!
Hi @milesgranger ,
Seems like dask still has some issues when running the mini benchmark CI. Could you take a look?
Thanks
Sorry 'bout that; before the refactor q9 was also fixed but forgotten after reverting. Should be okay now. :)
Hi Miles,
Did a quick run of the benchmarks last week. Will update the report this week which will include the renaming of arrow to R-arrow and the new click house results.
Sounds great! I'm assuming it included the fixes in this PR? Is there anything you need from me to merge this?
gentle ping @Tmonster
Hi @milesgranger,
Yes the updates are with the changes from this PR. Don't need anything from you for merging. Should be able to update the report today.
New dask results are up!
Hi,
This takes off from #58, I suppose we can close that in favor of this one.
I thought with all the repeated code that errors could (and did occur) so I've refactored it a bit. Each 'question' gets its own function, the calls for garbage collection and other per-run code has been moved into a decorator
bench
.Turned out on current master, some of the id columns are set to int32 dtypes which results in the publicized benchmarks reporting dask as having an internal error because of a bad implementation. This is unfortunate.
Additionally, I've changed the hard-coded dtypes to be inferred by dask, this results in no categorical dtypes which further improves performance.
I've taken the liberty of re-running the benchmarks (0.5, 5GB and 50GB) on a c6i.metal instance and confirm all run with the following caveats:
Query 8 "largest two v3 by id6" is missing an API in dask for SeriesGroupBy.nlargest which will make things much better. I'll link to the following issue later. The unknown here, is should I leave it commented out or add a
NotImplementedError
?Query 10 "sum v3 count by id1:id6" did fail on 50GB which we'll try to find time to address later as well.
Thank you!