REditorSupport / vscode-R

R Extension for Visual Studio Code
https://marketplace.visualstudio.com/items?itemName=REditorSupport.r
MIT License
1.07k stars 128 forks source link

View Apache Arrow Table #944

Closed eitsupi closed 2 years ago

eitsupi commented 2 years ago

If we print a Arrow Table object, we can see the columns and types as follows.

arrow::arrow_table(mtcars)
#> Table
#> 32 rows x 11 columns
#> $mpg <double>
#> $cyl <double>
#> $disp <double>
#> $hp <double>
#> $drat <double>
#> $wt <double>
#> $qsec <double>
#> $vs <double>
#> $am <double>
#> $gear <double>
#> $carb <double>
#>
#> See $metadata for additional Schema metadata

Created on 2022-01-16 by the reprex package (v2.0.1)

However, if we run View(arrow::arrow_table(mtcars)) on VSCode, the following view will be opened. It is not possible to decipher the structure of the Table from this view.

image

I think this is a common behavior for R6 class objects, and I wonder if it would be better to show the content of the print() function when a R6 object is opened in a viewer. Or, do you think it would be worth implementing a feature specifically for the Arrow Table?

renkun-ken commented 2 years ago

Do you think the behavior should be:

if (inherits(obj, "ArrowTabular")) {
  obj <- as.data.frame(obj)
}
.vsc.view(obj)
eitsupi commented 2 years ago

Thanks for the quick reply. In general, tables handled by Apache Arrow may have a very large number of rows, so I think it may not be a good idea to display them after converting all rows of them to data.frame in the current implementation.

As an example, when I tried to read the Parquet file (6001215 rows x 16 columns) used in the following post and display it after converting it to a data.frame, it took too long time. (I can't wait)

https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/ https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

Ideally, I feel it would be nice to have a display that shows a few lines and suggests that the rest of the lines are present, as dbplyr does.

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>

#>     intersect, setdiff, setequal, union
arrow::arrow_table(mtcars) |> arrow::to_duckdb()
#> # Source:   table<arrow_001> [?? x 11]
#> # Database: duckdb_connection
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with more rows

Created on 2022-01-16 by the reprex package (v2.0.1)

The following will work faster because it reads only enough data for one page display, but the drawback is that the information that there may be more than 101 rows in the table will be lost.

if (inherits(obj, "ArrowTabular")) {
  obj <- as.data.frame(utils::head(obj, 100))
}
.vsc.view(obj)

Compared to these methods, I thought it would be better to have simple print results displayed as it works faster without misunderstandings.

eitsupi commented 2 years ago

Since it takes time to display a huge table, even in a data.frame, it might be useful to add an option to limit the number of rows displayed to the viewer (pass to the utils::head()'s n argument).