Closed cgostic closed 11 months ago
:wave: thanks for reporting!
I have unsuccessfully tried to reproduce the issue reported in the stack overflow question on RStudio Connect.
Is it possible to trim your example down into a more minimal working example? I think there might be a few typos in there as well as additional libraries maybe not needed? I am not sure that this is an arrow issue yet but lets try to narrow that down. I am also wondering if you've tried this with a public facing data like this - can you reproduce with the taxi data?
library(arrow)
bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
open_dataset(bucket)
@boshek Thanks so much for the reply! I appreciate your time!
I've worked out a reprex I can share that still yields the error.
Attached Materials:
Steps to recreate:
1. Write ozone data to a private AWS bucket
library(arrow)
library(aws.s3)
# Using attached dataset
ozone <- read.csv(unz('ozone_data_2022.zip', 'ozone_data_2022.csv'))
bname <- '<your-bucket>'
db_uri <- paste0('s3://',bname)
write_dataset(ozone,
path = db_uri,
format = 'arrow',
partitioning = c('aqs_sitecode', 'sample_duration',
'parameter', 'poc'))
2. Deploy app
This app should run fine locally.
After deployment ( to shinyapps.io or RStudio Connect) I've observed the following:
# See attached lockfile for package versions
library(shiny)
library(dplyr)
library(ggplot2)
library(htmltools)
library(arrow)
library(aws.s3)
bname <- 'bucket'
db_uri <- paste0('s3://',bname)
ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)
ui <- fluidPage(
fluidRow(column(3,
selectInput('sitecode',
label = 'Select Site',
choices = unique(metadata$aqs_sitecode),
selected = NULL)),
column(2,
div(style = 'padding-top:26px',
actionButton('go', 'Create Plot', width = '100%')))),
fluidRow(plotOutput('TS')),
fluidRow(textOutput('connstr'))
)
server <- function(input, output, session) {
selected_site <- reactiveValues(sitecode = NULL)
observeEvent(input$go, {
selected_site$sitecode <- input$sitecode
})
plot_data <- reactive({
req(selected_site$sitecode)
s <- as.integer(selected_site$sitecode)
ds %>%
filter(aqs_sitecode == s,
parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
select(date_time2, sample_measurement) %>%
collect()
})
output$TS <- renderPlot({
req(selected_site$sitecode,
is.data.frame(plot_data()))
ggplot(plot_data()) +
geom_line(aes(date_time2, sample_measurement)) +
scale_x_datetime() +
labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
})
log_db <- reactivePoll(30000, session, #reactivePoll inside the server
# Check for maximum month
checkFunc = function() {
r_num <- ds %>%
filter(aqs_sitecode == 51190007,
sample_duration == '1 HOUR',
parameter == 'Ozone',
poc == 1) %>%
collect() %>%
nrow()
print(r_num)
return(as.character(r_num))
},
valueFunc = function() {
print('pinged')
return(" ")
}
)
output$connstr <- renderText(log_db())
}
shinyApp(ui = ui, server = server)
3. In-app workflow to create error
Hello and happy new year! Wondering if this issue has/will be revisited with the updated reprex above. Identifying whether the problem is with Arrow or a different software would really help us decide if we need to shift our data structure in the short term to meet deadlines.
Thank you so much for your time and attention! It is really appreciated! @boshek @rok
I can reproduce this issue in RStudio Connect.
This is the error that I see (which is different that what you are reporting):
I tried a few troubleshooting steps including:
collect
call around to see where exactly the issue was. the only thing that "worked" was to pull all the data in right away which obviously isn't helpfulTwo additional questions:
Thanks so much for checking in!
I do see that error intermittently as well, though I more frequently get the one listed in the logs. From my observations, it seems that the sigpipe issue occurs in the first use of the app after publishing, but it usually shifts to the "stack imbalance" after continued use.
(1) I can definitely check writing the partitioned data as .csvs and get back to you! However, we're really hoping to maximize querying efficiency offered by feather files.
(2) I have not tried this on a Linux machine. Our RStudio Connect is hosted on an EC2, and shinyapps is also on AWS. These are the only two platforms I have access to, unfortunately.
The feather files are rather large so for something like this where so much is going across the wire, csv or parquet can actually be quicker. That said, it should still work regardless of file format.
microbenchmark::microbenchmark(
arrow = open_dataset(arrow_bucket, format = 'arrow', unify_schemas = FALSE) %>%
filter(parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
collect(),
csv = open_dataset(csv_bucket, format = 'csv', unify_schemas = FALSE) %>%
filter( parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
collect(),
parquet = open_dataset(parquet_bucket, format = 'parquet', unify_schemas = FALSE) %>%
filter( parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
collect(), times = 3L
)
Unit: seconds
expr min lq mean median uq max neval
arrow 16.984894 17.353667 17.822625 17.722439 18.241490 18.760542 3
csv 10.173464 10.474577 10.876690 10.775691 11.228303 11.680915 3
parquet 8.599491 8.630459 8.705268 8.661427 8.758156 8.854885 3
The reason I suggested Linux was because rsc and shinyapps running on linux machines so maybe this isn't related at to RStudio Connect. That would be good to test. I will see what I can do.
Thanks for that insight, I misunderstood the memory requirements of .csv vs. feather!
I've tried your suggestion to write the partitioned data as .csvs and still get the same segfault error. My logs now reference auto_deleter_background. This is output from httpuv
, though it seems like it's a result of the preceding "memory not mapped" issue.
2023/01/10 9:23:03 AM: Running on host: <ip>
2023/01/10 9:23:03 AM: Linux distribution: Ubuntu 22.04.1 LTS (jammy)
2023/01/10 9:23:03 AM: Server version: 2022.09.0
2023/01/10 9:23:03 AM: LANG: C.UTF-8
2023/01/10 9:23:03 AM: Working directory: /opt/rstudio-connect/mnt/app 2023/01/10
9:23:03 AM: Running content using its packrat R library
2023/01/10 9:23:03 AM: Using Packrat dir /opt/rstudio-connect/mnt/app/packrat/lib/x86_64-pc-linux-gnu/4.2.1
2023/01/10 9:23:03 AM: R version: 4.2.1
2023/01/10 9:23:03 AM: shiny version: 1.7.4
2023/01/10 9:23:03 AM: httpuv version: 1.6.6
2023/01/10 9:23:03 AM: rmarkdown version: (none)
2023/01/10 9:23:03 AM: knitr version: (none)
2023/01/10 9:23:03 AM: jsonlite version: 1.8.4
2023/01/10 9:23:03 AM: RJSONIO version: (none)
2023/01/10 9:23:03 AM: htmltools version: 0.5.4
2023/01/10 9:23:03 AM: reticulate version: (none)
2023/01/10 9:23:03 AM: Using pandoc: /opt/rstudio-connect/ext/pandoc/2.16
2023/01/10 9:23:04 AM: Using Shiny bookmarking base directory /opt/rstudio-connect/mnt/bookmarks
2023/01/10 9:23:04 AM: 2023/01/10 9:23:04 AM: Starting R with process ID: '375796'
...
2023/01/10 9:27:39 AM: *** caught segfault ***
2023/01/10 9:27:39 AM: address 0x60, cause 'memory not mapped'
2023/01/10 9:27:39 AM: An irrecoverable exception occurred. R is aborting now ...
2023/01/10 9:27:39 AM: Can't detect correct thread for auto_deleter_background.
2023/01/10 9:27:39 AM: Can't detect correct thread for auto_deleter_background.
2023/01/10 9:27:39 AM: 2023/01/10 9:27:39 AM: *** caught segfault ***
2023/01/10 9:27:39 AM: address (nil), cause 'unknown'
EDIT: I had commented out the reactivePoll({}) to ping the datasource every 30 seconds. With this included, the issue is resolved in the reprex! Great suggestion.
I will try this solution on a larger scale and let you know if it's viable.
@boshek I re-wrote my full dataset to .csv and am still getting the same stack imbalance
then segfault
error as before.
It seems that there's a size limitation when combined with idle time. The full version of this app displays data from 2016-2022 (300-400k rows per site), compared to only 2022 in the reprex (5-10k rows).
Above, you indicated that you tested .parquet files to no avail, is that correct? This would indicate that the size of the flat-file may not be so important, as I'd expect .parquet to perform even better than .csv.
I am able to recreate the error by altering the reprex app.R code to simulate a larger dataset (see below). As before, when updating the app constantly, there is no issue. When left idle for 1+ minutes, the app crashes as before.
Note: This is after adjusting the dataset to .csv format from .arrow format
Reprex updates:
plot_data
reactive so that each time a new site is chosen, the whole dataset is loaded (300k rows)ggplot()
plot_data <- reactive({
req(selected_site$sitecode)
s <- as.integer(selected_site$sitecode)
ds %>%
filter(#aqs_sitecode == s, #### Remove this filter ... load all 300k rows on each update (simulates larger data load)
parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
select(aqs_sitecode, date_time2, sample_measurement) %>%
collect()
})
output$TS <- renderPlot({
req(selected_site$sitecode,
is.data.frame(plot_data()))
#### Add call to subset() in ggplot()
ggplot(subset(plot_data(), aqs_sitecode == selected_site$sitecode)) +
geom_line(aes(date_time2, sample_measurement)) +
scale_x_datetime() +
labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
})
Interesting. I sure would like to test this on a linux machine to confirm whether this is at all related to RStudio Connect.
@cgostic Thank you for posting some code and data to try and replicate. I was able to run into similar segfault issues without including the reactivePoll()
piece. Most of the time, when we see segfault memory-related issues it is not on the Connect side but on the application side, which makes me think it's related to the arrow::open_dataset()
function call being outside the server.
Could you please try moving it into the server in the reactive statement? When I made the switch and redeployed I have been unable to run into any of the memory errors. It seems having the dataset open and idle causes the issue. For reference the code I am using is below:
# See attached lockfile for package versions
library(shiny)
library(dplyr)
library(ggplot2)
library(htmltools)
library(arrow)
library(aws.s3)
aqs_site_code_unique <- c(51190007L, 60658001L, 60731022L, 100032004L, 110010043L, 120110034L,
120573002L, 130890002L, 170314201L, 180970078L, 295100085L, 371190041L,
371830014L, 420030008L, 440071010L, 510870014L, 20900034L, 40191028L,
60270002L, 60850005L, 121290001L, 150030010L, 170191001L, 191630015L,
230090103L, 300490004L, 310550019L, 340130003L, 360551007L, 380150003L,
380171004L, 391351001L, 470090101L, 10730023L, 40139997L, 60371103L,
60670006L, 202090021L, 260810020L, 270031002L, 320030540L, 330150018L,
390350060L, 390610040L, 410510080L, 421010048L, 471570075L, 482011039L,
490353006L, 530330080L, 60190011L, 80310026L, 90050005L, 90090027L,
160010010L, 220330009L, 240230002L, 240330030L, 250250042L, 280490020L,
330115001L, 350010023L, 360810124L, 361010003L, 401431127L, 450790007L,
481410044L, 500070007L, 530090013L, 540390020L, 560210100L, 720210010L,
320310031L, 400019009L)
ui <- fluidPage(
fluidRow(column(3,
selectInput('sitecode',
label = 'Select Site',
choices = aqs_site_code_unique,
selected = NULL)),
column(2,
div(style = 'padding-top:26px',
actionButton('go', 'Create Plot', width = '100%')))),
fluidRow(plotOutput('TS')),
)
server <- function(input, output, session) {
selected_site <- reactiveValues(sitecode = NULL)
observeEvent(input$go, {
selected_site$sitecode <- input$sitecode
})
plot_data <- reactive({
req(selected_site$sitecode)
s <- as.integer(selected_site$sitecode)
bname <- 'BUCKETNAMEHERE'
db_uri <- paste0('s3://', bname)
ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)
ds %>%
filter(parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
select(aqs_sitecode, date_time2, sample_measurement) %>%
collect()
})
output$TS <- renderPlot({
req(selected_site$sitecode,
is.data.frame(plot_data()))
ggplot(subset(plot_data(), aqs_sitecode == selected_site$sitecode)) +
geom_line(aes(date_time2, sample_measurement)) +
scale_x_datetime() +
labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
})
}
shinyApp(ui = ui, server = server)
@tnederlof that's a great thought! I had previously considered this option, but our full dataset takes a while (~8-10 s) to load via arrow::open_dataset() (possibly because of its size/number of partitions?). It would decrease usability if a user had to wait each time an input was changed. Am I understanding correctly that this would be the case?
However, this idea could be viable if we abandon loading the full dataset each time and instead use inputs to build a more specific uri -- i.e. a subset of the dataset (e.g. calling open_dataset() on 's3://bucket/sitecode=xxxx/parameter=xxxx/...', or even just read_csv_arrow() on a uri pointing to the desired csv).
I'll do some testing on this today, and please let me know if you foresee any issues in the meantime.
I appreciate your time and input!
Thats kind of surprising to me it takes the long to open the dataset, I suspect all the partitioning is causing issues. I was able to replicate the issue you faced using the same partitioning structure (I just faked 24x times the data with different intervals). Then I tried saving all of the data in a single parquet file (its about 1gb) and now it runs in <0.5sec instead of 8-9seconds.
There is a good writeup about partitioning performance here: https://arrow.apache.org/docs/r/articles/dataset.html#partitioning-performance-considerations
Could you please try saving the data with less partitioning?
For example, I wrote the data like the code below so it partitions just on aqs_sitecode:
write_dataset(dplyr::group_by(ozone, aqs_sitecode),
path = db_uri,
format = 'parquet')
Then in the app:
arrow::open_dataset(db_uri, format = 'parquet') %>%
filter(aqs_sitecode == s,
parameter == 'Ozone',
sample_duration == '1 HOUR',
poc == 1) %>%
select(date_time2, sample_measurement) %>%
collect()
This is looking great. As you said, query time is within bounds, and this eliminates the timeout issue.
Thank you so much for your time and attention, @boshek and @tnederlof!
Closing this as it appears this was resolved, feel free to reopen if that isn't correct though!
Describe the bug, including details regarding any error messages, version, and platform.
Issue description:
I have an RShiny app that pulls data from a hive-partitioned dataset hosted in a private AWS bucket using the R
arrow
package. The dataset contains air quality data. It is ~30million rows in total, and partitioned by site and pollutant. One of the tabs in the app that utilizes the dataset allows a user to choose a site and pollutant to display in a timeseries, prompting data collection from the dataset using dplyr+arrow to execute a query. Each site/parameter combination requires only ~100k rows for visualization.Currently, this app is hosted on Rstudio Connect, though the issue also occurs on shinyapps.io. The error occurs only when the app is deployed/published to a server-- there are no issues when the app is running locally, even if the app is left idle for a long time.
Eventually, the app will crash with the errors "Warning: stack imbalance ...", then "caught segfault" and "memory not mapped" when a user selects an option that kicks off a query from the AWS-hosted dataset. The period of time that the app functions after being reset varies, sometimes crashing on the first click after opening the app and other times crashing only after a few minutes. If left open long enough, the deployed app will always return this error on an action that prompts a data pull from the AWS-hosted dataset.
Example error messages:
Troubleshooting steps:
I suspected that the process maintaining the connection between AWS and the server was idling/diconnecting, and used a
reactivePoll
to collect from the dataset every 30 seconds (see below) to prevent the process from idling. This minimal collection in the reactivepoll is always successful, even after up to 15 minutes of running the app. However, this does not prevent the error from occurring when attempting to access the time series tab.REPREX:
This work is part of a project that can't yet be shared publicly, but a reproducible example of a similar, if not the same, issue is available in this stackoverflow post: https://stackoverflow.com/questions/73654587/how-can-i-use-r-arrow-and-aws-s3-in-a-shiny-app-deployed-on-ec2-with-shinyproxy
My basic app setup is below:
SessionInfo:
Component(s)
R