[WIP] New solution: `r-collapse`

vincentarelbundock commented 11 months ago

Attempt at a new solution, as requested by @grantmcdermott in Issue #3: the collapse package for R.

Website: https://sebkrantz.github.io/collapse/

Tagging @Tmonster and @MichaelChirico who has expressed interest on Twitter. Tagging the collapse author @SebKrantz because my solutions may not be most efficient.

Things seem to work for me locally:

`collapse`

 ./_launcher/solution.R --solution=collapse --task=groupby --nrow=1e7 --quiet=true

   on_disk                    question run time_sec
1    FALSE               sum v1 by id1   1    0.157
2    FALSE               sum v1 by id1   2    0.146
3    FALSE           sum v1 by id1:id2   1    0.309
4    FALSE           sum v1 by id1:id2   2    0.262
5    FALSE       sum v1 mean v3 by id3   1    0.325
6    FALSE       sum v1 mean v3 by id3   2    0.315
7    FALSE           mean v1:v3 by id4   1    0.175
8    FALSE           mean v1:v3 by id4   2    0.176
9    FALSE            sum v1:v3 by id6   1    0.355
10   FALSE            sum v1:v3 by id6   2    0.333
11   FALSE  median v3 sd v3 by id4 id5   1    0.738
12   FALSE  median v3 sd v3 by id4 id5   2    0.486
13   FALSE      max v1 - min v2 by id3   1    0.303
14   FALSE      max v1 - min v2 by id3   2    0.307
15   FALSE       largest two v3 by id6   1    1.335
16   FALSE       largest two v3 by id6   2    1.265
17   FALSE regression v1 v2 by id2 id4   1    0.783
18   FALSE regression v1 v2 by id2 id4   2    0.775
19   FALSE     sum v3 count by id1:id6   1    1.571
20   FALSE     sum v3 count by id1:id6   2    1.524

`data.table`

./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7 --quiet=true
   on_disk                    question run time_sec
1    FALSE               sum v1 by id1   1    0.157
2    FALSE               sum v1 by id1   2    0.106
3    FALSE           sum v1 by id1:id2   1    0.115
4    FALSE           sum v1 by id1:id2   2    0.148
5    FALSE       sum v1 mean v3 by id3   1    0.181
6    FALSE       sum v1 mean v3 by id3   2    0.188
7    FALSE           mean v1:v3 by id4   1    0.251
8    FALSE           mean v1:v3 by id4   2    0.203
9    FALSE            sum v1:v3 by id6   1    0.241
10   FALSE            sum v1:v3 by id6   2    0.562
11   FALSE  median v3 sd v3 by id4 id5   1    0.640
12   FALSE  median v3 sd v3 by id4 id5   2    0.632
13   FALSE      max v1 - min v2 by id3   1    0.564
14   FALSE      max v1 - min v2 by id3   2    0.542
15   FALSE       largest two v3 by id6   1    0.542
16   FALSE       largest two v3 by id6   2    0.498
17   FALSE regression v1 v2 by id2 id4   1    0.628
18   FALSE regression v1 v2 by id2 id4   2    0.596
19   FALSE     sum v3 count by id1:id6   1    1.464
20   FALSE     sum v3 count by id1:id6   2    1.438

jangorecki commented 11 months ago

Very nice. What is still missing here is benchplot dictionary. Possibly as well entries in report.R script. If you are able to generate the final report then it is good. Otherwise something still needs to be missing. What would be nice to have is also 'join' task, assuming collapse has it. I recall rejecting a solution, not sure which one it was, vaax/ray or some other pandas based, because they had grouping only. If join is not available in collapse then could potentially fall back to R's join - the same way that arrow join used to fall back to dplyr join. Adding solution only for grouping is not ideal.

vincentarelbundock commented 11 months ago

Thanks for taking a look!

Joins will be available in collapse 2.0.0, to be released in about 1 month.

The author @SebKrantz says that he will submit a complete PR after the new version is released.

Since he wants to do this himself, there is no point in me working on this PR anymore, but I will leave it here in case it is useful to get Seb started.

kadyb commented 10 months ago

FYI: The new version of {collapse} with the join() function is now available!

vincentarelbundock commented 10 months ago

Great news! As noted above, I will not work on this PR anymore and will wait for @SebKrantz to complete it. (I don't mean to put pressure by tagging; he can take his time. I'm just signifying my intent.)

SebKrantz commented 10 months ago

Thanks! Yeah I wanted to wait for the release (so we‘re not working with a development version), but will now see to do this before the end of the weekend. If you have additional capacities during the week @vincentarelbundock I’d actually welcome it if you want to finish the PR from your side. I’ll just need to do some final oversight to ensure we’re comparing apples with apples and the implementation is efficient. Again thanks for the initiative!

vincentarelbundock commented 10 months ago

I believe the last two commits complete the config as suggested by @jangorecki

I also added a join script but wrote it pretty quickly, and I'm unlikely to have more time for this. @SebKrantz, you probably want to have a careful side-by-side look to make sure the solutions are identical.

SebKrantz commented 10 months ago

I have adjusted the scripts setting global options that appear optimal for performance and presentation (through unable to test on 1 billion rows), and adjusted the expressions by @vincentarelbundock a bit to what I think would be an ideally efficient collapse-way of doing this. I am still waiting for @vincentarelbundock to merge my changes to benchplot-dict, but otherwise I think this PR is ready for review.

SebKrantz commented 10 months ago

Thanks @vincentarelbundock. @Tmonster the PR is ready for review.

SebKrantz commented 10 months ago

Thanks @Tmonster. @vincentarelbundock to do this in the current PR you need to update your fork, and then I need to send a PR to your fork again… Thanks!

vincentarelbundock commented 10 months ago

@SebKrantz done

SebKrantz commented 10 months ago

@vincentarelbundock thanks. @Tmonster we are ready.

Tmonster commented 10 months ago

@SebKrantz Thanks, seems like there are errors in the out file, and the time.csv / logs.csv files can't be validated. Could you have a look?

Tmonster commented 10 months ago

If you merge with master, then the regression yaml will upload a separate out file for every solution

vincentarelbundock commented 10 months ago

I merged with master

Tmonster commented 10 months ago

great! I'm not sure if that will fix the issues though. You should try running the regression.yml yourself until you get to the error, then you can see what's wrong. For some reason the result of 1 query is different between two runs. Usually this is a syntax error. We do have a threshold built in, so it won't be that.

I would suggest running the benchmark via the steps in the regression.yml script and seeing whats wrong

SebKrantz commented 10 months ago

@Tmonster my understanding is that this issue with dask is not caused by collapse?

Tmonster commented 10 months ago

Great thanks! The dask issue I know about and will fix eventually

duckdblabs / db-benchmark

[WIP] New solution: `r-collapse` #33

`collapse`

`data.table`