duckdblabs / duckplyr

A drop-in replacement for dplyr, powered by DuckDB for performance.
https://duckdblabs.github.io/duckplyr/
Other
225 stars 12 forks source link

Need to be stricter about column compatibility in joins #168

Closed krlmlr closed 2 weeks ago

krlmlr commented 2 months ago

Also, use simple identity instead of r_base::== for joins.

dplyr::union_all(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::union_all()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::setdiff(data.frame(a = 1), data.frame(a = "2"))
#> Error in `dplyr::setdiff()`:
#> ! `x` and `y` are not compatible.
#> ✖ Incompatible types for column `a`: double vs character.
dplyr::left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::left_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
dplyr::semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `dplyr::semi_join()`:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.

duckplyr:::duckplyr_union_all(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Union
#>   r_dataframe_scan(0x11d52b258)  r_dataframe_scan(0x11d0c4e38)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
#> 2   2
duckplyr:::duckplyr_setdiff(data.frame(a = 1), data.frame(a = "2"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Except
#>   r_dataframe_scan(0x11d556c70)  r_dataframe_scan(0x11d55aea0)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (VARCHAR)
#> 
#>     a
#> 1 1.0
duckplyr:::duckplyr_left_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> Error in `left_join()` at duckplyr/R/left_join.R:19:3:
#> ! Can't join `x$a` with `y$a` due to incompatible types.
#> ℹ `x$a` is a <double>.
#> ℹ `y$a` is a <character>.
duckplyr:::duckplyr_semi_join(data.frame(a = 1), data.frame(a = "2"), by = "a")
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Join REGULAR SEMI ___eq_na_matches_na(lhs.a, rhs.a)
#>   r_dataframe_scan(0x14aace6d0)
#>   r_dataframe_scan(0x13b892678)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> 
#> [1] a
#> <0 rows> (or 0-length row.names)

Created on 2024-05-16 with reprex v2.1.0

nicki-dese commented 2 months ago

Hi, just to add, I've found an issue that I think is related:

I was finding the maximum value for several variables using the following code:

vector_of_file_paths |>
  map(\(path)
    duckplyr_df_from_csv(path) |>
    select(any_of(vector_of_numeric_col_names)) |>
    rename_with(tolower) |>
    head(5) |>
    summarise(across(everything(), ~ max(.x, na.rm = T))) |>
 bind_rows()

and it was failing at the last step when a column only happened to have NAs in the first 5 rows. If I didn't row bind, I could see the result column had a value of -Inf. I can see from the output during run that these result columns were interpreted by duckplyr as "varchar"

I am unable to share logs from the environment I work in, sorry.

krlmlr commented 2 months ago

Thanks, @nicki-dese, this is a different problem. Can you please open a new issue?

A reprex would be very useful. This would be a self-contained example with toy data, see, e.g., https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html .

krlmlr commented 2 months ago

Set operations are good now, joins need https://github.com/tidyverse/dplyr/pull/7029.

nicki-dese commented 1 month ago

Sorry for the delayed reply. I tried generating a reprex and it didn’t error in the same way. If I can manage to reproduce, I’ll post a new issue.

Nicki Norris Assistant Director - Data Phone (02) 6240 8969

From: Kirill Müller @.> Sent: Tuesday, May 21, 2024 2:48 AM To: duckdblabs/duckplyr @.> Cc: NORRIS,Nicki @.>; Mention @.> Subject: Re: [duckdblabs/duckplyr] Need to be stricter about column compatibility (Issue #168)

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

Thanks, @nicki-desehttps://github.com/nicki-dese, this is a different problem. Can you please open a new issue?

A reprex would be very useful. This would be a self-contained example with toy data, see, e.g., https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html .

— Reply to this email directly, view it on GitHubhttps://github.com/duckdblabs/duckplyr/issues/168#issuecomment-2120822504, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJUET37UGV7JJ4GWTEVKZLZDISNNAVCNFSM6AAAAABHZJ4TVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRQHAZDENJQGQ. You are receiving this because you were mentioned.Message ID: @.**@.>> Notice:

The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. If you are not the intended recipient, any use, disclosure or copying of this email is unauthorised. If you received this email in error, please notify the sender by contacting the department's switchboard on 1300 566 046 during business hours (8:30am - 5pm Canberra time) and delete all copies of this transmission together with any attachments.

krlmlr commented 2 weeks ago

Done now.