etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
141 stars 3 forks source link

Export functions to read/scan/write data #111

Open etiennebacher opened 2 months ago

etiennebacher commented 2 months ago

So far I only exported sink_* functions because they don't risk namespace collision with other packages, while exporting write_parquet() or read_parquet() would conflict with arrow for example.

However, some users do not know the existence of pl$read_parquet() and pl$scan_parquet(), and therefore use arrow::read_parquet() and as_polars_df() which is not efficient at all. The goal of tidypolars is to replace the somewhat confusing (to R users) syntax of polars so that they don't have to deal with pl$ for instance. Therefore, I shouldn't expect them to use pl$scan_parquet().

The easy solution would be to add the "_polars" suffix for read/write functions (and potentially sink and scan for consistency?), so I would export read_parquet_polars() for instance. duckplyr has duckplyr_df_from_parquet(), so one option would be to export polars_df_from_parquet() and polars_lf_from_parquet() instead of read and scan.

Edit: not a big fan of polars_lf_from_parquet() because I like seeing all the options in the autocompletion when I type "write"

eitsupi commented 2 months ago

Related issue: apache/arrow#38456

Personally, I think it would be fine to have more R-like style functions like read_parquet_polars(path, ..., as_data_frame = FALSE) in the polars package.

This would be similar to, for example, Python Polars having something like the polars.DataFrame.pipe method to make method chaining work in Python.

etiennebacher commented 2 months ago

Personally, I think it would be fine to have more R-like style functions like read_parquet_polars(path, ..., as_data_frame = FALSE) in the polars package.

Why should it be in polars? There are already functions to import and export data there so I don't see why we should duplicate those

eitsupi commented 2 months ago

Why should it be in polars? There are already functions to import and export data there so I don't see why we should duplicate those

Of course, it doesn't have to be present, but the mere sugar syntax is present in Python Polars.

Also, as for write_*, I think the incompatibility of the pipe |> and the $ operator reinforces the need for it to exist as a function. e.g. we should write like pl$DataFrame(...)$some_methods(...) |> some_function(...) |> (\(x) x$write_parquet(...))()

ginolhac commented 2 months ago

the incompatibility of the pipe |> and the $ operator reinforces the need for it to exist as a function.

Of note, in R4.3 (and probably 4.2, I am not sure) the native placeholder _ works with the $

> women |> _$weight
 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
etiennebacher commented 2 months ago

If we introduce this kind of functions in polars itself, then we'd have two kind of syntax for the same thing, e.g pl$read_parquet() and read_parquet_polars(). Wouldn't that lead to confusion, similarly as in the arrow issue you linked above?

eitsupi commented 2 months ago

Of note, in R4.3 (and probably 4.2, I am not sure) the native placeholder _ works with the $

I think this is not the case in this case. x |> _$foo() is not allowed.

eitsupi commented 2 months ago

Wouldn't that lead to confusion, similarly as in the arrow issue you linked above?

The problem with the arrow package is that the function names are inconsistent. In other words, there are only read_parquet and read_csv_arrow instead of read_csv and read_paquet_arrow.