apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.29k stars 3.47k forks source link

[R][R Shiny] Deployed/published Rshiny app using arrow to query an AWS-hosted dataset causes intermittent "stack imbalance", "segfault", "memory not mapped" errors. #15056

Closed cgostic closed 11 months ago

cgostic commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Issue description:

I have an RShiny app that pulls data from a hive-partitioned dataset hosted in a private AWS bucket using the R arrow package. The dataset contains air quality data. It is ~30million rows in total, and partitioned by site and pollutant. One of the tabs in the app that utilizes the dataset allows a user to choose a site and pollutant to display in a timeseries, prompting data collection from the dataset using dplyr+arrow to execute a query. Each site/parameter combination requires only ~100k rows for visualization.

Currently, this app is hosted on Rstudio Connect, though the issue also occurs on shinyapps.io. The error occurs only when the app is deployed/published to a server-- there are no issues when the app is running locally, even if the app is left idle for a long time.

Eventually, the app will crash with the errors "Warning: stack imbalance ...", then "caught segfault" and "memory not mapped" when a user selects an option that kicks off a query from the AWS-hosted dataset. The period of time that the app functions after being reset varies, sometimes crashing on the first click after opening the app and other times crashing only after a few minutes. If left open long enough, the deployed app will always return this error on an action that prompts a data pull from the AWS-hosted dataset.

Example error messages:

2022/12/20 20:48:33.816768370 Warning: stack imbalance in '$', 325 then 328
2022/12/20 20:48:33.816831632 Warning: stack imbalance in '<<-', 328 then 330
2022/12/20 20:48:33.816846663 
2022/12/20 20:48:33.816848493  *** caught segfault ***
2022/12/20 20:48:33.816938465 address 0x2c, cause 'memory not mapped'
2022/12/20 20:48:33.816944956 
2022/12/20 20:48:33.816963716  *** caught segfault ***
2022/12/20 20:48:33.816965356 address 0xe, cause 'memory not mapped'
2022/12/20 20:48:33.816975686 
2022/12/20 20:48:33.816976807  *** caught segfault ***
2022/12/20 20:48:33.816987197 address 0x30, cause 'memory not mapped'
2022/12/20 20:48:33.817012408 Warning: stack imbalance in '$', 244 then 259

Troubleshooting steps:

I suspected that the process maintaining the connection between AWS and the server was idling/diconnecting, and used a reactivePoll to collect from the dataset every 30 seconds (see below) to prevent the process from idling. This minimal collection in the reactivepoll is always successful, even after up to 15 minutes of running the app. However, this does not prevent the error from occurring when attempting to access the time series tab.

  log_db <- reactivePoll(30000, session, #reactivePoll inside the server
                         # Check for maximum month
                         checkFunc = function() {
                           print(ds %>% 
                             filter(sitecode == 'abc',
                                    pollutant == 'def') %>%
                             collect() %>%
                             nrow())

                           ds %>% 
                             filter(sitecode == 'abc',
                                    pollutant == 'def') %>%
                             collect() %>%
                             nrow()
                         },
                         valueFunc = function() {
                         paste0("  ")
                         }
  )

  # With a corresponding  outputText('connstr') in the ui
  output$connstr <- renderText(log_db())

REPREX:

This work is part of a project that can't yet be shared publicly, but a reproducible example of a similar, if not the same, issue is available in this stackoverflow post: https://stackoverflow.com/questions/73654587/how-can-i-use-r-arrow-and-aws-s3-in-a-shiny-app-deployed-on-ec2-with-shinyproxy

My basic app setup is below:

library(arrow)
library(tidyverse)
library(shiny)
library(shinycssloaders)
library(tidyverse)
library(sonomaDashboardUI)
library(shinyWidgets)
library(aws.s3)
library(shinyjs)
library(shinybusy)

db_uri <- paste0('s3://','<bucketname>')
ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)

ui <- fluidPage(
    tabSetPanel('tabs',
        tabPanel('tab that uses dataset',
            fluidRow(selectInput('sitecode', 'Site', choices = c(...))),
            fluidRow(selectInput('pollutant', 'Pollutant', choices = c(...))),
            fluidRow(plotOutput('plotThatUsesDS')),
        tabPanel('tab that does not use dataset',
            fluidRow(plotOutput('plotThatDoesNotUseDS')))

server <- function(session, input, output) {
    plot_ds_data <- reactive({
                                            ds %>% 
                                            filter(site == input$site, pollutant == input$pollutant) %>%
                                            collect()
    })

    output$plotThatUsesDS <- renderPlot({ ggplot(plot_ds_data()) + geom_line(...) })
}

SessionInfo:

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] shinybusy_0.3.1              shinyjs_2.1.0                aws.s3_0.3.21               
 [4] shinyWidgets_0.7.2           sonomaDashboardUI_0.0.0.9000 forcats_0.5.1               
 [7] stringr_1.4.0                dplyr_1.0.9                  purrr_0.3.4                 
[10] readr_2.1.2                  tidyr_1.2.0                  tibble_3.1.8                
[13] ggplot2_3.4.0                tidyverse_1.3.2              shinycssloaders_1.0.0       
[16] shiny_1.7.3                  arrow_10.0.1                

loaded via a namespace (and not attached):
  [1] googledrive_2.0.0          colorspace_2.0-3           deldir_1.0-6               ellipsis_0.3.2            
  [5] gghighlight_0.4.0          leaflet_2.1.1              snakecase_0.11.0           base64enc_0.1-3           
  [9] fs_1.5.2                   rstudioapi_0.13            hexbin_1.28.2              DT_0.24                   
 [13] bit64_4.0.5                fansi_1.0.3                lubridate_1.8.0            xml2_1.3.3                
 [17] splines_4.2.2              R.methodsS3_1.8.2          cachem_1.0.6               jsonlite_1.8.0            
 [21] openair_2.10-0             broom_1.0.1                cluster_2.1.4              dbplyr_2.2.1              
 [25] png_0.1-7                  PerformanceAnalytics_2.0.4 R.oo_1.25.0                mapproj_1.2.8             
 [29] compiler_4.2.2             httr_1.4.3                 backports_1.4.1            assertthat_0.2.1          
 [33] Matrix_1.5-1               fastmap_1.1.0              gargle_1.2.1               cli_3.4.1                 
 [37] later_1.3.0                htmltools_0.5.3            tools_4.2.2                gtable_0.3.0              
 [41] glue_1.6.2                 maps_3.4.0                 Rcpp_1.0.9                 cellranger_1.1.0          
 [45] jquerylib_0.1.4            styler_1.7.0               vctrs_0.5.1                nlme_3.1-160              
 [49] crosstalk_1.2.0            rvest_1.0.3                mime_0.12                  lifecycle_1.0.3           
 [53] renv_0.16.0                googlesheets4_1.0.1        MASS_7.3-58.1              zoo_1.8-10                
 [57] scales_1.2.0               hms_1.1.1                  promises_1.2.0.1           RColorBrewer_1.1-3        
 [61] quantmod_0.4.20            curl_4.3.2                 aws.signature_0.6.0        gridExtra_2.3             
 [65] sass_0.4.2                 latticeExtra_0.6-30        stringi_1.7.8              TTR_0.24.3                
 [69] rlang_1.0.6                pkgconfig_2.0.3            lattice_0.20-45            htmlwidgets_1.5.4         
 [73] bit_4.0.4                  tidyselect_1.1.2           magrittr_2.0.3             R6_2.5.1                  
 [77] generics_0.1.3             DBI_1.1.3                  pillar_1.8.0               haven_2.5.1               
 [81] withr_2.5.0                mgcv_1.8-41                xts_0.12.1                 tidyquant_1.0.4           
 [85] janitor_2.1.0              modelr_0.1.9               crayon_1.5.1               Quandl_2.11.0             
 [89] interp_1.1-3               utf8_1.2.2                 tzdb_0.3.0                 viridis_0.6.2             
 [93] jpeg_0.1-9                 grid_4.2.2                 readxl_1.4.1               reprex_2.0.2              
 [97] digest_0.6.29              xtable_1.8-4               R.cache_0.16.0             httpuv_1.6.5              
[101] R.utils_2.12.0             munsell_0.5.0              viridisLite_0.4.0          bslib_0.4.0               
[105] quadprog_1.5-8            

Component(s)

R

boshek commented 1 year ago

:wave: thanks for reporting!

I have unsuccessfully tried to reproduce the issue reported in the stack overflow question on RStudio Connect.

Is it possible to trim your example down into a more minimal working example? I think there might be a few typos in there as well as additional libraries maybe not needed? I am not sure that this is an arrow issue yet but lets try to narrow that down. I am also wondering if you've tried this with a public facing data like this - can you reproduce with the taxi data?

library(arrow)
bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
open_dataset(bucket)
cgostic commented 1 year ago

@boshek Thanks so much for the reply! I appreciate your time!

I've worked out a reprex I can share that still yields the error.

Attached Materials:

Steps to recreate:

1. Write ozone data to a private AWS bucket

library(arrow)
library(aws.s3)

# Using attached dataset
ozone <- read.csv(unz('ozone_data_2022.zip', 'ozone_data_2022.csv'))

bname <- '<your-bucket>'
db_uri <- paste0('s3://',bname)

write_dataset(ozone, 
              path = db_uri, 
              format = 'arrow', 
              partitioning = c('aqs_sitecode', 'sample_duration', 
                               'parameter', 'poc'))

2. Deploy app

This app should run fine locally.

After deployment ( to shinyapps.io or RStudio Connect) I've observed the following:

# See attached lockfile for package versions
library(shiny)
library(dplyr)
library(ggplot2)
library(htmltools)
library(arrow)
library(aws.s3)

bname <- 'bucket'
db_uri <- paste0('s3://',bname)
ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)

ui <- fluidPage(
  fluidRow(column(3,
                  selectInput('sitecode',
                              label = 'Select Site',
                              choices = unique(metadata$aqs_sitecode),
                              selected = NULL)),
           column(2,
                  div(style = 'padding-top:26px',
                      actionButton('go', 'Create Plot', width = '100%')))),
  fluidRow(plotOutput('TS')),
  fluidRow(textOutput('connstr'))

)

server <- function(input, output, session) {
  selected_site <- reactiveValues(sitecode = NULL)
  observeEvent(input$go, {
    selected_site$sitecode <- input$sitecode
  })

  plot_data <- reactive({
    req(selected_site$sitecode)

    s <- as.integer(selected_site$sitecode)

    ds %>%
      filter(aqs_sitecode == s,
             parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(date_time2, sample_measurement) %>%
      collect()
  })

  output$TS <- renderPlot({
    req(selected_site$sitecode, 
        is.data.frame(plot_data()))

    ggplot(plot_data()) +
      geom_line(aes(date_time2, sample_measurement)) +
      scale_x_datetime() +
      labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
  })

  log_db <- reactivePoll(30000, session, #reactivePoll inside the server
                         # Check for maximum month
                         checkFunc = function() {
                           r_num <- ds %>%
                                   filter(aqs_sitecode == 51190007,
                                          sample_duration == '1 HOUR',
                                          parameter == 'Ozone',
                                          poc == 1) %>%
                                   collect() %>%
                                   nrow()
                           print(r_num)
                           return(as.character(r_num))

                         },
                         valueFunc = function() {
                           print('pinged')
                           return(" ")
                         }
  )

  output$connstr <- renderText(log_db())
}

shinyApp(ui = ui, server = server)

3. In-app workflow to create error

shinyappsio_logs.txt

rstudio_connect.log

ozone_data_2022.zip

renv_lock_txt.txt

sonomatechDS commented 1 year ago

Hello and happy new year! Wondering if this issue has/will be revisited with the updated reprex above. Identifying whether the problem is with Arrow or a different software would really help us decide if we need to shift our data structure in the short term to meet deadlines.

Thank you so much for your time and attention! It is really appreciated! @boshek @rok

boshek commented 1 year ago

I can reproduce this issue in RStudio Connect.

This is the error that I see (which is different that what you are reporting):

log ``` 2023/01/09 23:47:44.920714517 [1] 3785 2023/01/09 23:47:45.194163992 [1] 3785 2023/01/09 23:47:45.197808737 [1] "pinged" 2023/01/09 23:48:15.466341119 [1] 3785 2023/01/09 23:48:45.750669396 [1] 3785 2023/01/09 23:49:16.046789291 [1] 3785 2023/01/09 23:49:46.366873116 [1] 3785 2023/01/09 23:50:16.639291193 [1] 3785 2023/01/09 23:50:46.902676880 [1] 3785 2023/01/09 23:51:17.211774091 [1] 3785 2023/01/09 23:51:20.593104281 Warning: Error in collect: ignoring SIGPIPE signal 2023/01/09 23:51:20.606683408 212: 2023/01/09 23:51:20.606694399 211: signalCondition 2023/01/09 23:51:20.606724639 210: signal_abort 2023/01/09 23:51:20.606743769 209: abort 2023/01/09 23:51:20.606782270 208: augment_io_error_msg 2023/01/09 23:51:20.606783790 207: value[[3L]] 2023/01/09 23:51:20.606797460 206: tryCatchOne 2023/01/09 23:51:20.606798670 205: tryCatchList 2023/01/09 23:51:20.606891601 204: tryCatch 2023/01/09 23:51:20.606893811 203: collect.arrow_dplyr_query 2023/01/09 23:51:20.606907001 202: collect 2023/01/09 23:51:20.606907791 201: %>% 2023/01/09 23:51:20.606914991 200: [/opt/rstudio-connect/mnt/app/app.R#50] 2023/01/09 23:51:20.606927181 198: .func 2023/01/09 23:51:20.606945612 195: contextFunc 2023/01/09 23:51:20.606947502 194: env$runWith 2023/01/09 23:51:20.606972372 187: ctx$run 2023/01/09 23:51:20.606973622 186: self$.updateValue 2023/01/09 23:51:20.606983322 184: plot_data 2023/01/09 23:51:20.606984412 182: renderPlot [/opt/rstudio-connect/mnt/app/app.R#62] 2023/01/09 23:51:20.606993812 180: func 2023/01/09 23:51:20.606995122 140: drawPlot 2023/01/09 23:51:20.607005132 126: 2023/01/09 23:51:20.607006152 110: drawReactive 2023/01/09 23:51:20.607019292 97: renderFunc 2023/01/09 23:51:20.607022683 96: output$TS 2023/01/09 23:51:20.607032143 15: 2023/01/09 23:51:20.607033293 13: fn 2023/01/09 23:51:20.607046283 8: retry 2023/01/09 23:51:20.607047413 7: connect$retryingStartServer 2023/01/09 23:51:20.607056303 6: eval 2023/01/09 23:51:20.607058203 5: eval 2023/01/09 23:51:20.607072323 4: eval 2023/01/09 23:51:20.607073053 3: eval 2023/01/09 23:51:20.607080343 2: eval.parent 2023/01/09 23:51:20.607081113 1: local ```

I tried a few troubleshooting steps including:

Two additional questions:

cgostic commented 1 year ago

Thanks so much for checking in!

I do see that error intermittently as well, though I more frequently get the one listed in the logs. From my observations, it seems that the sigpipe issue occurs in the first use of the app after publishing, but it usually shifts to the "stack imbalance" after continued use.

(1) I can definitely check writing the partitioned data as .csvs and get back to you! However, we're really hoping to maximize querying efficiency offered by feather files.

(2) I have not tried this on a Linux machine. Our RStudio Connect is hosted on an EC2, and shinyapps is also on AWS. These are the only two platforms I have access to, unfortunately.

boshek commented 1 year ago

The feather files are rather large so for something like this where so much is going across the wire, csv or parquet can actually be quicker. That said, it should still work regardless of file format.

microbenchmark::microbenchmark(
   arrow = open_dataset(arrow_bucket, format = 'arrow', unify_schemas = FALSE) %>% 
     filter(parameter == 'Ozone',
            sample_duration == '1 HOUR',
            poc == 1) %>% 
     collect(),
   csv = open_dataset(csv_bucket, format = 'csv', unify_schemas = FALSE) %>% 
     filter( parameter == 'Ozone',
            sample_duration == '1 HOUR',
            poc == 1) %>% 
     collect(),
   parquet = open_dataset(parquet_bucket, format = 'parquet', unify_schemas = FALSE) %>% 
     filter( parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>% 
     collect(), times = 3L
 )
Unit: seconds
    expr       min        lq      mean    median        uq       max neval
   arrow 16.984894 17.353667 17.822625 17.722439 18.241490 18.760542     3
     csv 10.173464 10.474577 10.876690 10.775691 11.228303 11.680915     3
 parquet  8.599491  8.630459  8.705268  8.661427  8.758156  8.854885     3

The reason I suggested Linux was because rsc and shinyapps running on linux machines so maybe this isn't related at to RStudio Connect. That would be good to test. I will see what I can do.

cgostic commented 1 year ago

Thanks for that insight, I misunderstood the memory requirements of .csv vs. feather!

I've tried your suggestion to write the partitioned data as .csvs and still get the same segfault error. My logs now reference auto_deleter_background. This is output from httpuv, though it seems like it's a result of the preceding "memory not mapped" issue.

2023/01/10 9:23:03 AM: Running on host: <ip>
2023/01/10 9:23:03 AM: Linux distribution: Ubuntu 22.04.1 LTS (jammy) 
2023/01/10 9:23:03 AM: Server version: 2022.09.0 
2023/01/10 9:23:03 AM: LANG: C.UTF-8 
2023/01/10 9:23:03 AM: Working directory: /opt/rstudio-connect/mnt/app 2023/01/10 
9:23:03 AM: Running content using its packrat R library 
2023/01/10 9:23:03 AM: Using Packrat dir /opt/rstudio-connect/mnt/app/packrat/lib/x86_64-pc-linux-gnu/4.2.1 
2023/01/10 9:23:03 AM: R version: 4.2.1 
2023/01/10 9:23:03 AM: shiny version: 1.7.4 
2023/01/10 9:23:03 AM: httpuv version: 1.6.6 
2023/01/10 9:23:03 AM: rmarkdown version: (none) 
2023/01/10 9:23:03 AM: knitr version: (none) 
2023/01/10 9:23:03 AM: jsonlite version: 1.8.4 
2023/01/10 9:23:03 AM: RJSONIO version: (none) 
2023/01/10 9:23:03 AM: htmltools version: 0.5.4 
2023/01/10 9:23:03 AM: reticulate version: (none) 
2023/01/10 9:23:03 AM: Using pandoc: /opt/rstudio-connect/ext/pandoc/2.16 
2023/01/10 9:23:04 AM: Using Shiny bookmarking base directory /opt/rstudio-connect/mnt/bookmarks 
2023/01/10 9:23:04 AM: 2023/01/10 9:23:04 AM: Starting R with process ID: '375796' 

...

2023/01/10 9:27:39 AM: *** caught segfault *** 
2023/01/10 9:27:39 AM: address 0x60, cause 'memory not mapped' 
2023/01/10 9:27:39 AM: An irrecoverable exception occurred. R is aborting now ... 
2023/01/10 9:27:39 AM: Can't detect correct thread for auto_deleter_background. 
2023/01/10 9:27:39 AM: Can't detect correct thread for auto_deleter_background. 
2023/01/10 9:27:39 AM: 2023/01/10 9:27:39 AM: *** caught segfault *** 
2023/01/10 9:27:39 AM: address (nil), cause 'unknown' 
cgostic commented 1 year ago

EDIT: I had commented out the reactivePoll({}) to ping the datasource every 30 seconds. With this included, the issue is resolved in the reprex! Great suggestion.

I will try this solution on a larger scale and let you know if it's viable.

sonomatechDS commented 1 year ago

@boshek I re-wrote my full dataset to .csv and am still getting the same stack imbalance then segfault error as before.

It seems that there's a size limitation when combined with idle time. The full version of this app displays data from 2016-2022 (300-400k rows per site), compared to only 2022 in the reprex (5-10k rows).

Above, you indicated that you tested .parquet files to no avail, is that correct? This would indicate that the size of the flat-file may not be so important, as I'd expect .parquet to perform even better than .csv.

I am able to recreate the error by altering the reprex app.R code to simulate a larger dataset (see below). As before, when updating the app constantly, there is no issue. When left idle for 1+ minutes, the app crashes as before.

Note: This is after adjusting the dataset to .csv format from .arrow format

Reprex updates:

  plot_data <- reactive({
    req(selected_site$sitecode)

    s <- as.integer(selected_site$sitecode)

    ds %>%
       filter(#aqs_sitecode == s, #### Remove this filter ... load all 300k rows on each update (simulates larger data load)
             parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(aqs_sitecode, date_time2, sample_measurement) %>%
      collect()
  })

  output$TS <- renderPlot({
    req(selected_site$sitecode, 
        is.data.frame(plot_data()))

   #### Add call to subset() in ggplot()
    ggplot(subset(plot_data(), aqs_sitecode == selected_site$sitecode)) +
      geom_line(aes(date_time2, sample_measurement)) +
      scale_x_datetime() +
      labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
  })
boshek commented 1 year ago

Interesting. I sure would like to test this on a linux machine to confirm whether this is at all related to RStudio Connect.

tnederlof commented 1 year ago

@cgostic Thank you for posting some code and data to try and replicate. I was able to run into similar segfault issues without including the reactivePoll() piece. Most of the time, when we see segfault memory-related issues it is not on the Connect side but on the application side, which makes me think it's related to the arrow::open_dataset() function call being outside the server.

Could you please try moving it into the server in the reactive statement? When I made the switch and redeployed I have been unable to run into any of the memory errors. It seems having the dataset open and idle causes the issue. For reference the code I am using is below:

# See attached lockfile for package versions
library(shiny)
library(dplyr)
library(ggplot2)
library(htmltools)
library(arrow)
library(aws.s3)

aqs_site_code_unique <- c(51190007L, 60658001L, 60731022L, 100032004L, 110010043L, 120110034L, 
                          120573002L, 130890002L, 170314201L, 180970078L, 295100085L, 371190041L, 
                          371830014L, 420030008L, 440071010L, 510870014L, 20900034L, 40191028L, 
                          60270002L, 60850005L, 121290001L, 150030010L, 170191001L, 191630015L, 
                          230090103L, 300490004L, 310550019L, 340130003L, 360551007L, 380150003L, 
                          380171004L, 391351001L, 470090101L, 10730023L, 40139997L, 60371103L, 
                          60670006L, 202090021L, 260810020L, 270031002L, 320030540L, 330150018L, 
                          390350060L, 390610040L, 410510080L, 421010048L, 471570075L, 482011039L, 
                          490353006L, 530330080L, 60190011L, 80310026L, 90050005L, 90090027L, 
                          160010010L, 220330009L, 240230002L, 240330030L, 250250042L, 280490020L, 
                          330115001L, 350010023L, 360810124L, 361010003L, 401431127L, 450790007L, 
                          481410044L, 500070007L, 530090013L, 540390020L, 560210100L, 720210010L, 
                          320310031L, 400019009L)

ui <- fluidPage(
  fluidRow(column(3,
                  selectInput('sitecode',
                              label = 'Select Site',
                              choices = aqs_site_code_unique,
                              selected = NULL)),
           column(2,
                  div(style = 'padding-top:26px',
                      actionButton('go', 'Create Plot', width = '100%')))),
  fluidRow(plotOutput('TS')),

)

server <- function(input, output, session) {

  selected_site <- reactiveValues(sitecode = NULL)

  observeEvent(input$go, {
    selected_site$sitecode <- input$sitecode
  })

  plot_data <- reactive({
    req(selected_site$sitecode)

    s <- as.integer(selected_site$sitecode)

    bname <- 'BUCKETNAMEHERE'
    db_uri <- paste0('s3://', bname)
    ds <- arrow::open_dataset(db_uri, format = 'arrow', unify_schemas = F)
    ds %>%
      filter(parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(aqs_sitecode, date_time2, sample_measurement) %>%
      collect()
  })

  output$TS <- renderPlot({
    req(selected_site$sitecode, 
        is.data.frame(plot_data()))

    ggplot(subset(plot_data(), aqs_sitecode == selected_site$sitecode)) +
      geom_line(aes(date_time2, sample_measurement)) +
      scale_x_datetime() +
      labs(x = 'DateTime', y = 'Ozone in ppb', main = selected_site$sitecode)
  })
}

shinyApp(ui = ui, server = server)
sonomatechDS commented 1 year ago

@tnederlof that's a great thought! I had previously considered this option, but our full dataset takes a while (~8-10 s) to load via arrow::open_dataset() (possibly because of its size/number of partitions?). It would decrease usability if a user had to wait each time an input was changed. Am I understanding correctly that this would be the case?

However, this idea could be viable if we abandon loading the full dataset each time and instead use inputs to build a more specific uri -- i.e. a subset of the dataset (e.g. calling open_dataset() on 's3://bucket/sitecode=xxxx/parameter=xxxx/...', or even just read_csv_arrow() on a uri pointing to the desired csv).

I'll do some testing on this today, and please let me know if you foresee any issues in the meantime.

I appreciate your time and input!

tnederlof commented 1 year ago

Thats kind of surprising to me it takes the long to open the dataset, I suspect all the partitioning is causing issues. I was able to replicate the issue you faced using the same partitioning structure (I just faked 24x times the data with different intervals). Then I tried saving all of the data in a single parquet file (its about 1gb) and now it runs in <0.5sec instead of 8-9seconds.

There is a good writeup about partitioning performance here: https://arrow.apache.org/docs/r/articles/dataset.html#partitioning-performance-considerations

Could you please try saving the data with less partitioning?

For example, I wrote the data like the code below so it partitions just on aqs_sitecode:

write_dataset(dplyr::group_by(ozone, aqs_sitecode), 
              path = db_uri, 
              format = 'parquet')

Then in the app:

arrow::open_dataset(db_uri, format = 'parquet') %>%
      filter(aqs_sitecode == s,
             parameter == 'Ozone',
             sample_duration == '1 HOUR',
             poc == 1) %>%
      select(date_time2, sample_measurement) %>%
      collect()
sonomatechDS commented 1 year ago

This is looking great. As you said, query time is within bounds, and this eliminates the timeout issue.

Thank you so much for your time and attention, @boshek and @tnederlof!

thisisnic commented 11 months ago

Closing this as it appears this was resolved, feel free to reopen if that isn't correct though!