Open giocomai opened 7 months ago
Here's a more revealing reprex. By creating a data frame with strings of growing length, the issue is much clearer.
Conditions for reproducing:
write_dataset
open_dataset
stringr::str_detect()
If all these conditions are met, then arrow returns an empty data frame.
If the dataset is stored partitioned, partitions where both conditions are met return zero rows, while other partitions return data as expected.
With non-ASCII characters, e.g. with cyrillic letters such as б, г, д, etc., the issue emerges with text size of just over 2000 characters (both here and above, I'd suspect the actual limits would be the classic 4096 and 2048).
A separate issue with non-ASCII characters that may lead to inconsistencies in testing related to how arrow
parses regex (with re2, if I understand well). compared to standard stringr::str_detect(), hence same code with/without arrow may give different results. Mentioning here just in case this may be somehow related.
library("tibble")
library("dplyr")
library("stringr")
library("arrow")
set.seed(1)
data_df <- tibble::tibble(size = 2:10000) |>
dplyr::mutate(text = paste(c("a",
sample(x = c(letters, LETTERS),
size = 10000,
replace = TRUE)),
collapse = "")) |>
dplyr::group_by(size) |>
dplyr::mutate(text = stringr::str_trunc(text, width = size, ellipsis = "")) |>
dplyr::mutate(category = round(size/10)) |>
dplyr::ungroup() |>
dplyr::group_by(category)
data_df[["text"]][sample(c(TRUE, FALSE), size = nrow(data_df), prob = c(0.1, 0.9), replace = TRUE)] <- ""
### Store in a temp folder
test_arrow_path <- file.path(tempdir(), "test_arrow")
write_dataset(dataset = data_df,
path = test_arrow_path)
### Read from temp folder
arrow_from_disk <- open_dataset(test_arrow_path)
### Read from memory
arrow_from_memory <- arrow_table(data_df)
filtered_from_disk_df <- arrow_from_disk |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::collect()
filtered_from_disk_df
#> # A tibble: 5,652 × 3
#> size text category
#> <int> <chr> <int>
#> 1 995 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 2 996 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 3 998 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 4 1000 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 5 1001 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 6 1002 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 7 1003 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 8 1004 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 9 1005 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 100
#> 10 95 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 10
#> # ℹ 5,642 more rows
filtered_from_memory_df <- arrow_from_memory |>
dplyr::filter(stringr::str_detect(text, "a")) |>
dplyr::collect()
filtered_from_memory_df
#> # A tibble: 9,000 × 3
#> # Groups: category [1,001]
#> size text category
#> <int> <chr> <dbl>
#> 1 2 ad 0
#> 2 3 adM 0
#> 3 4 adMa 0
#> 4 5 adMaH 0
#> 5 6 adMaHw 1
#> 6 7 adMaHwQ 1
#> 7 8 adMaHwQn 1
#> 8 9 adMaHwQnr 1
#> 9 10 adMaHwQnrY 1
#> 10 11 adMaHwQnrYG 1
#> # ℹ 8,990 more rows
dplyr::anti_join(filtered_from_memory_df,
filtered_from_disk_df,
by = "size") |>
dplyr::arrange(size)
#> # A tibble: 3,348 × 3
#> # Groups: category [392]
#> size text category
#> <int> <chr> <dbl>
#> 1 4095 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 2 4097 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 3 4098 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 4 4099 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 5 4100 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 6 4101 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 7 4102 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 8 4103 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 9 4104 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> 10 4105 adMaHwQnrYGuuPTjgiouKOyTKKHPyRoGtIfjPLUtBtRwfNRyfMYPfxFnbSrvn… 410
#> # ℹ 3,338 more rows
nrow(filtered_from_disk_df)==nrow(filtered_from_memory_df)
Created on 2024-04-15 with reprex v2.1.0
Hi @giocomai, thank you for the report and all the investigation. I don't have a great guess as to what's happening yet but I'll take a look this week.
I did a quick pass with the reprex (thank you!) and ensure that even with an identical query plan (except the source node), there are a different number of rows that are selected. The next step would be to reproduce in Python since the people who know how to fix it are better at debugging it there (I may get there in the next few minutes but just leaving this here in case I don't!).
Describe the bug, including details regarding any error messages, version, and platform.
I've spent a few hours trying to pinpoint exactly when this issue appears. The reprex below should make this clear.
The type of dataset creating this issue is a data frame with:
""
; NA do not seem to be an issue); if such rows are removed, then the issue does not emergeThe issue appears only if:
write_dataset()
, and is then read withopen_dataset
. If it is created witharrow_table
in memory, the issue does not appear. If the dataset is stored partitioned (grouped before writing) the issue is apparently limited to groups where an empty string is present.Under these conditions, the filter returns an incomplete set of rows. If the same
arrow
connection is collected before filtering, then it returns the expected result.Even if it returns an incomplete set of rows it throws no errors or warnings: the user will not notice unless they conduct additional tests.
In my real-world case, this happens with textual corpora; it seems to be happening more frequently (i.e. even if strings are shorter) with corpora with non-latin characters, but I haven't found the exact threshold.
Tested with both current version on CRAN as well as current development version, details in reprex below.
Created on 2024-04-12 with reprex v2.1.0
Component(s)
R