duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
126 stars 24 forks source link

Execute queries without registering Data Frames / Arrow Tables #140

Open eitsupi opened 7 months ago

eitsupi commented 7 months ago

From duckdb/duckdb#6771

It is convenient in the Python client to specify the target of a query without having to register pandas.DataFrame, etc., so it would be nice to have the same functionality in R.

krlmlr commented 6 months ago

Are you proposing to make available all data-frame-like things for accessing them by name? From what environment -- the current environment plus all parent (enclosing) environments?

How would we control this feature? I suspect just enabling it might lead to surprises for existing code.

What about name clashes? Which object gets priority if a table by that name exists already?

eitsupi commented 6 months ago

What about name clashes? Which object gets priority if a table by that name exists already?

I forget where I read this, but I believe DuckDB has the ability to look for tables that are not on the DB from other places and just use that.

For example, the behavior is already different when the table test.csv exists and when it does not exist, as shown below. (Needless to say, if the table test.csv does not exist, it will look for a CSV file named test.csv and use it as a virtual table.)

data.frame(bar = 2) |>
  write.csv("test.csv", row.names = FALSE)
duckdb:::sql('CREATE SCHEMA "test"; CREATE TABLE "test.csv" AS SELECT 1 AS foo; FROM "test.csv"')
#>   foo
#> 1   1

Created on 2024-04-24 with reprex v2.0.2

data.frame(bar = 2) |>
  write.csv("test.csv", row.names = FALSE)
duckdb:::sql('FROM "test.csv"')
#>   bar
#> 1   2

Created on 2024-04-24 with reprex v2.0.2

How would we control this feature? I suspect just enabling it might lead to surprises for existing code.

Given that this is already in use in the Python API, this is hardly a problem.

From what environment -- the current environment plus all parent (enclosing) environments?

Perhaps an additional argument is needed to specify the environment.

hannes commented 6 months ago

Indeed this should be straightforward, the R package can define a so-called replacement scan for this.

hannes commented 6 months ago

I looked into this, it's not that straightforward for arrow regrettably. So for now let's just do this for data.frames.

hannes commented 6 months ago

First stab is here: https://github.com/duckdb/duckdb-r/pull/164