duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
136 stars 27 forks source link

Update name for Arrow R package #66

Closed amoeba closed 7 months ago

amoeba commented 8 months ago

Hi @Tmonster, I contribute to the Arrow R package and noticed Arrow's inclusion on https://duckdblabs.github.io/db-benchmark/ which is great to see.

I think a minor improvement to the way this benchmark is presented would be to make it clearer that a specific implementation of Arrow is what's being benchmarked since the name Arrow by itself generally refers to the columnar format and not any one implementation.

I've updated the README and made my best guess at changing the name in the plots since I didn't run them. Let me know if that looks like it'll do the trick. For the changed name for the plots, I opted to follow the style you're already using with (py)data.table but feel free to suggest something else. We generally refer to the Arrow R package as "R arrow" or "arrow R" so any string which contains those would be good.

Tmonster commented 8 months ago

Hi @amoeba,

I just forked the benchmark and started maintaining it as it was before, so sorry about the mixup. I took a look at your PR and I think a bit more work is required to change the name from arrow to R-arrow.

Most switches are minor, you could almost just do a find replace with "arrow" and "r-arrow". One thing that isn't simple to fix is how to deal with the old results. Results are loaded and divided based on solution name, dataset, question etc. I don't necessarily want to change the name of all of the old results since that can muddy the history of the time.csv and logs.csv.

I'll look into how we can change those values once the results are loaded for report generation. I'll get back to you on this

jangorecki commented 8 months ago

Easiest to update to new name in report.R file

eitsupi commented 8 months ago

I am wondering about the need to mark it as Acero as per the discussion in the old repo. h2oai/db-benchmark#229

jangorecki commented 8 months ago

This is the place for cleaning up time.csv entries if something got renamed or produced different results in the past etc. https://github.com/duckdblabs/db-benchmark/blob/49016238fdd78dcbc4e5119bb0cd34487ca5419c/_report/report.R#L68-L84

dt[solution=="arrow", "solution" := "newname"]
amoeba commented 8 months ago

In the event that updating past results is hard or problematic, I think a change going forward would be just fine. Thanks for being willing to look into this @Tmonster and let me know if I can help.

@eitsupi: Thanks for finding that thread. While Acero makes some sense, or even arrow-dplyr as was also suggested in that thread, I think "R arrow" or similar is still probably the best. While Acero is mostly what's responsible for the features and performance the benchmark is reporting on, (1) the Arrow R package is the installable thing the benchmark runs and (2) the code snippet shown next to each result isn't really Acero in the broad sense but the novel bindings the Arrow R package built on top. I know you're familiar with all of the above so the above so this is just to explain my rationale for the proposed rename. Happy to continue this discussion here or elsewhere though.

eitsupi commented 8 months ago

Thanks for your response.

I think the point here is that in most other cases, the "engine" is used as the name, not the language or API. That said, I think data.table, Spark, DuckDB, and Polars can be manipulated via dplyr, but I don't think we are actually using dplyr here, so something like "dplyr-Acero" could be appropriate to explicitly state that it is dplyr API.

The name Acero is now in dplyr's README, so users should be able to quickly find out what it is.

thisisnic commented 8 months ago

IMO dplyr-Acero is OK as long as both components are mentioned as it's not a good test of just Acero given that the R package is adding a lot more in here. That said, I feel like "arrow R package" or "R arrow" is much more recognisable shorthand for what is effectively multiple things bundled together here as there are more things in the R package which may affect the performance than just Acero and our dplyr bindings - the internal R code which works out how to construct ExecPlans to run in Acero, how we make use of ALTREP, etc.

Tmonster commented 8 months ago

Hi @amoeba

I have a branch up here that should handle all of the logic to rename arrow to R-arrow in the report and also during future logging. It is running CI right now, but feel free to copy the PR, change the name to something else if desired and open it up here against master. If CI is green I can merge and also update the report.

https://github.com/Tmonster/db-benchmark/pull/10

amoeba commented 7 months ago

Thanks @Tmonster, "R-arrow" is great. And thanks for jumping on this so quickly, we appreciate it.

Tmonster commented 7 months ago

PR for this is up here! https://github.com/duckdblabs/db-benchmark/pull/68

amoeba commented 7 months ago

Thanks @Tmonster!