R cohort codegen sample

calbach commented 2 years ago

For review only. I do not intend to merge this.

The problem

Currently, the AoU R codegen uses bq_table_download(). This is slow, but mostly works. However, with some recent bigrquery package changes (https://github.com/r-dbi/bigrquery/issues/461), it has become apparent that the package behaves poorly in cases where BigQuery returns pages smaller than the requested page size. Previously, it seems this resulted in silent data issues; now it results in an error (reported by AoU users).

This does not look like a quick fix in the package, and the package documentation also recommends using a CSV-based approach for larger datasets anyways https://www.rdocumentation.org/packages/bigrquery/versions/1.4.0/topics/bq_table_download

Proposed solution

Change the AoU R codegen to instead:

Export the query results to GCS as CSV
Localize the CSV
Load CSV shards as dataframes and concat

Known issues

Previously, we coerced all BigQuery numbers to the int64 type; this avoids potential issues with large table IDs (concept IDs, etc). I don't see an easy way to do this; likely my approach would coerce all number columns in the dataframe to int64 explicitly.

calbach commented 2 years ago

Hi @deflaux, looking for your R expertise on this both on the approach and the coding style. Open to completely changing the approach as well, but wanted to put a strawman here for discussion at least.

deflaux commented 2 years ago

Hi @deflaux, looking for your R expertise on this both on the approach and the coding style. Open to completely changing the approach as well, but wanted to put a strawman here for discussion at least.

This is a good strawman @calbach ! I'd like to test it and offer a more brief version of this code. Should have time to take a look tomorrow. I'll send a commit here so that you can see the diff.

deflaux commented 2 years ago

Full example is in https://workbench.researchallofus.org/workspaces/aou-rw-d03ec71b/duplicateoftestanddevelopcodesnippetsonaouv5/notebooks/test_r_dataset_builder_snippets_export_csv.ipynb

deflaux commented 2 years ago

I will start updating our codegen with this approach, thanks!

One stylistic question: do you prefer this as a single cell, or do you recommend multiple? e.g. maybe splitting cell1: SQL + BQ save, cell2: bind_rows

The split you suggest sounds good to me - especially because there could be a failure in the map operation. I did not test this on something truly enormous. This is just a guess - but if the data pulled down is larger than what can be loaded into RAM, I think it would fail part way through that map operation with a dead kernel. But the user could see the storage path to their exported data output from the prior cell and just load some of it if they do not want to increase RAM to be able to load all of it.

deflaux commented 2 years ago

@deflaux one change I'm noticing from the older codegen: this now generates concept ID columns as type <dbl> in the preview (which appears to just be a "numeric" type in the dataframe spec). In the old codegen, these would show as bigint (IIRC). Does the behavior here seem reasonable to you?

I also tested by injecting our largest concept ID into one of the CSV inputs, and this did not change anything. I do have a slight concern that the autopicking of column types might be inconsistent across CSV shards, and then there might be a bug during the merge.

Overall, I think its more common to use double for enormous ints since R does not natively support them. The change in type seems reasonable to me.

We can use https://readr.tidyverse.org/reference/cols.html for the person_id column type ensure its consistent across shards. I added a commit with that update.

calbach commented 2 years ago

To clarify - I'm not specifically concerned with person_id, more with the general heterogeneity of data across columns. It seems like it could be possible for shard0.csv to autodetect a column like measurement_value as string, and then shard1.csv it gets detected as numeric. I assume this would result in an error during bind_rows

deflaux commented 2 years ago

What you describe is definitely possible. My current assumption is that its not likely due to the OMOP data format - its generally not too sparse.

Might be worth having several people kick the tires on this in preprod and get feedback?
How about spinning up a machine with a ton of RAM and test a SELECT * on several tables?

One way to mitigate is to switch from map to a for loop, and then use the cols() determined from loading the first CSV as the col_type parameter when loading the rest of the CSVs.

calbach commented 2 years ago

Thanks @deflaux , I will pick this back up after the break. Yesterday I tried for a bit to get the types from the first tibble and apply them to the next, but I wasn't able to get it working, I may try a bit more (I missed your reference to cols(), I'll see if this helps).

Also:

After the break, will test this out on some large dataframes in preprod/prod
I did confirm that if the tibble schemas disagree, bind_rows will fail. I have yet to reproduce a legitimate situation where the tibble schemas disagree though.

deflaux commented 2 years ago

PTAL, this approach seems to be working. Tested on all surveys and a large measurements dataset in prod, without issue.

In particular, I'm curious whether you think this new col_types approach is worth the added complexity. Types are at least consistent within a single query - but you might get different answers depending on the order of results or the particular cohort query you use.

Nice work - the code you added for col_types looks great!! It's pretty brief, so I think it's okay to add it even if in practice it's often not needed. But let's consider updating it to be a function since people will need to use it over and over in other notebooks. The function could also be added to the collection of R workbench snippets for easy pasting.

Regarding the value of col_types being data dependent: the most important thing is to successfully load the dataframe into R and this code accomplishes that. Once they have the dataframe, people can add code to cast particular columns to other types, as desired.

# Read the data directly from Cloud Storage into memory.
# NOTE: Alternatively you can `gsutil -m cp {survey_export_15947426_path} .` to copy these files
#       to the Jupyter disk.
read_bq_export_from_workspace_bucket <- function(export_path) {
    col_types <- NULL
    bind_rows(
        map(system2('gsutil', args = c('ls', export_path), stdout = TRUE, stderr = TRUE),
            function(csv) {
                message(str_glue('Loading {csv}.'))
                chunk <- read_csv(pipe(str_glue('gsutil cat {csv}')), col_types = col_types, show_col_types = FALSE)
                if (is.null(col_types)) {
                    col_types <- spec(chunk)    
                }
                chunk
            })
    )
}

In other notebooks, people could paste in this one function, and then the following calls will work fine:

# folder path
foo <- read_bq_export_from_workspace_bucket(
    'gs://fc-secure-11de59ab-3e1a-447e-8201-f6d5e52dfbf3/bq_exports/deflaux@researchallofus.org/20211216/person_87319558/')

# glob path
bar <- read_bq_export_from_workspace_bucket(
    'gs://fc-secure-11de59ab-3e1a-447e-8201-f6d5e52dfbf3/bq_exports/deflaux@researchallofus.org/20211216/person_87319558/person_87319558_*.csv')

calbach commented 2 years ago

Thanks for the reviews, will go with this version for now (with the function split as suggested).

all-of-us / workbench-snippets