apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[R] `Invalid metadata$r` warning when feeding parquet file into dplyr #40423

Open rkrug opened 7 months ago

rkrug commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Hi I have a parquet file (https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0) which is causing an Invalid metadata$r warning. It seems to be working fine, but the warning is annoying.

The file is written from R as part of a partitioning database, and the error occurs with others as well. Please find the code and the link to the file at the end.

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29)
 os       macOS Sonoma 14.4
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2024-03-08
 pandoc   3.1.12.2 @ /opt/homebrew/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version  date (UTC) lib source
 arrow       * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
 assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.3.0)
 bit           4.0.5    2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5    2020-08-30 [1] CRAN (R 4.3.0)
 cachem        1.0.8    2023-05-01 [1] CRAN (R 4.3.0)
 cli           3.6.2    2023-12-11 [1] CRAN (R 4.3.1)
 devtools      2.4.5    2022-10-11 [1] CRAN (R 4.3.0)
 digest        0.6.34   2024-01-11 [1] CRAN (R 4.3.1)
 ellipsis      0.3.2    2021-04-29 [1] CRAN (R 4.3.0)
 fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.3    2023-07-20 [1] CRAN (R 4.3.0)
 glue          1.7.0    2024-01-09 [1] CRAN (R 4.3.1)
 htmltools     0.5.7    2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.4    2023-12-06 [1] CRAN (R 4.3.1)
 httpuv        1.6.14   2024-01-26 [1] CRAN (R 4.3.1)
 jsonlite      1.8.8    2023-12-04 [1] CRAN (R 4.3.1)
 later         1.3.2    2023-12-06 [1] CRAN (R 4.3.1)
 lifecycle     1.0.4    2023-11-07 [1] CRAN (R 4.3.1)
 magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
 memoise       2.0.1    2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12     2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1  2018-05-18 [1] CRAN (R 4.3.0)
 pkgbuild      1.4.3    2023-12-10 [1] CRAN (R 4.3.1)
 pkgload       1.3.4    2024-01-16 [1] CRAN (R 4.3.1)
 profvis       0.3.8    2023-05-02 [1] CRAN (R 4.3.0)
 promises      1.2.1    2023-08-10 [1] CRAN (R 4.3.0)
 purrr         1.0.2    2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.12   2024-01-09 [1] CRAN (R 4.3.1)
 remotes       2.4.2.1  2023-07-18 [1] CRAN (R 4.3.0)
 rlang         1.1.3    2024-01-10 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
 shiny         1.8.0    2023-11-17 [1] CRAN (R 4.3.1)
 stringi       1.8.3    2023-12-11 [1] CRAN (R 4.3.1)
 stringr       1.5.1    2023-11-14 [1] CRAN (R 4.3.1)
 tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1    2021-11-30 [1] CRAN (R 4.3.0)
 usethis       2.2.3    2024-02-19 [1] CRAN (R 4.3.1)
 vctrs         0.6.5    2023-12-01 [1] CRAN (R 4.3.1)
 xtable        1.8-4    2019-04-21 [1] CRAN (R 4.3.0)

 [1] /Users/rainerkrug/R/library/aarch64-apple-darwin20/4.3
 [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

─────────────
arrow::write_dataset(
            data, 
            path = arrow_dir,
            partitioning = "publication_year" ,
            format = "parquet",
            existing_data_behavior = "overwrite"
        )
> arrow::open_dataset("./data/corpus/publication_year=1500/part-0.parquet") |> dplyr::group_by(author_abbr)
FileSystemDataset (query)
id: string
author: string
ab: string
doi: string
topics: string
author_abbr: string

* Grouped by author_abbr
See $.data for the source Arrow object
Warning message:
Invalid metadata$r 
> 

The Parquet file can be downloaded from: https://www.dropbox.com/scl/fi/lsg2xxe565dfa88e9plo4/part-0.parquet?rlkey=3w2sjc6xewaz9lxd4cwcvf65b&dl=0

Component(s)

R

amoeba commented 7 months ago

Thanks @rkrug, I was able to reproduce the error with the data you provided. I'll have a look soon and report back.

lucasmation commented 2 weeks ago

any updates on this?

I am having a possibly related problem when trying to save a data.table to a parquet file

library(tidyverse)
library(data.table)
library(arrow)  # version 17.0.0.1
options(arrow.use_dt = TRUE)

# 1m sample
d[1:(10^6)] %>% write_dataset('C:/test1')
t1 <- open_dataset('C:/test1') %>% collect() #
class(t1)
"data.table" "data.frame"

# 10m sample
d[1:(10*10^6)] %>% write_dataset('C:/test2')
class(t2)
"data.table" "data.frame"

# 20m sample
d[1:(20*10^6)] %>% write_dataset('C:/test3')
class(t3)
"data.table" "data.frame"

#Full data (270m obs)
d %>% write_dataset('C:/test4')
Warnings
1: Invalid metadata$r 
2: Invalid metadata$r 
3: Invalid metadata$r
class(t4)
"tbl_df"     "tbl"        "data.frame"

The weird thing is that when I feed smaller samples of the data the parquet file is saved without warnings and the "open_dataset > collect" operation returns a data.table as expected.

However, when I feed the full dataset(270m), there are 3 "Invalid metadata$r" warnings and the "open_dataset > collect" returns a " "tbl_df" "tbl" "data.frame" " object