fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

Can't read 0 columns of a fst file #255

Closed Kodiologist closed 1 year ago

Kodiologist commented 3 years ago
> read.fst(..., columns = character())
Error in res$resTable[[1]] : subscript out of bounds

If you're wondering why I'd want to do this, it's that I just wanted to compute how many rows are in a large file as fast as possible.

riccardoporreca commented 3 years ago

@Kodiologist, if that is the sole reason for that, I guess you can simply use fst::metadata_fst() (you can try on the example code)

fst::metadata_fst(fst_file)$nrOfRows

The documentation linked above is not too explicit about this, but you can check the structure of the the classed list returned by metadata_fst() via

str(fst::metadata_fst(fst_file))

Alternatively, you can also use fst::fst()

ft <- fst::fst(fst_file)
nrow(ft)

which is however an indirect method based on fst::metadata_fst() and may add some (minor) overhead.

On the other hand, it might be good if fst::read_fst() would handle 0-length columns and consistently return a data.frame with the expected nr of rows and 0 columns.

Kodiologist commented 3 years ago

Oh, I'd neglected the metadata_fst and fst functions. Thanks. In that case, I guess there's no real need to make this work, although the error message could probably be better.

MarcusKlik commented 3 years ago

Hi @Kodiologist, thanks for your question! As @riccardoporreca shows in his comment, most metadata can be retrieved from a fst file by applying the usual functions on the corresponding fst table object:

tmp_file <- tempfile(".fst")

# write some data to a fst file
data.frame(X = 1:10) %>%
  fst::write_fst(tmp_file)

# get a reference to the fst file store
ft <- fst::fst(tmp_file)

# the number of rows
nrow(ft)
#> [1] 10

# or number of columns
ncol(ft)
#> [1] 1

# column names
colnames(ft)
#> [1] "X"

# row names
rownames(ft)
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

The dev version now returns a zero-column table:

fst::read_fst(tmp_file, character())
#> data frame with 0 columns and 0 rows

Would you prefer to have this function return a zero-column, 10 row table like with a data.frame?

data.frame(X = 1:10)[, character()]
#> data frame with 0 columns and 10 rows

(the behavior of data.frame's, data.table's and tibble's is not very consistent in this case)

data.frame(X = 1:10)[, character()]
#> data frame with 0 columns and 10 rows

tibble::tibble(X = 1:10)[, character()]
#> # A tibble: 10 x 0

data.table::data.table(X = 1:10)[, .()]
#> Null data.table (0 rows and 0 cols)
Kodiologist commented 3 years ago

Thanks for the tips. My impression is that returning a 0-column, n-row data frame (or data table) is most logical, because it's consistent with the usual case of selecting nonzero columns.

MarcusKlik commented 1 year ago

Hi @Kodiologist, like with data.table, fst now returns a 0 by 0 table when an empty column vector is selected:

tmp_file <- tempfile(fileext = "fst")

# write sample fst file
data.frame(
  X = sample(sample(1:100, replace = TRUE))
) |>
  fst::write_fst(tmp_file)

fst::read_fst(tmp_file, character(0))
#> data frame with 0 columns and 0 rows

hope that works for you, if there is a use case where reporting on the number of rows is important, please reopen this issue, thanks!