Make `read_resource()` more modular

read_resource() is very long. It could benefit from a more modular approach.

✅ Get resource, includes check_package()

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L202-L204

[x] Already a helper function

✅ Get paths, schema and fields

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L205-L210

[x] Not much gain here

✅ Check all selected columns appear in schema

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L211-L222

[x] Typical check, not much gain here

❌ Create locale with encoding, decimal_mark and grouping_mark

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L223-L262

[x] Needs helper function get_encoding() (or encoding())
[x] Needs resource$encoding and resource$fields, so best to pass resource (and package)
[x] Returns readr::locale

❌ Create col_types: list(, , ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L263-L327

What we want is:

cols(
  field1 = col_integer(),
  field2 = col_character(),
  ...
)

[x] Helper function get_cols() (or cols()). It could work on a single field, but more useful is that it always returns a readr::cols()
[x] It then also aligns with a future get_na_cols() (or na_cols()) cf. https://kylehusmann.com/interlacer/reference/na_cols.html
[x] Maybe internally it also has a function get_col_type() (or col()) for a single field (so get_cols() is mostly a wrapper).
[x] Do we need a helper function per column type? Would likely improve readability if just have a switch on field type that then correctly passes the field to the correct function?

So:

get_cols <- function(fields) {
  for field:
    get_col(field)
}

get_col <- function(field) {
  type <-
  switch(type,
    "string" = col_string(),
    "number" = col_number(),
    ...
  )
}

col_string <- function() ...
col_number <- function(field) {
  get_enum()
  get_bare_number()
  get_format()
  ...
}

❌ Assign names: list("name1" = , "name2" = ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L328-L330

[x] This should be integrated in get_cols()

✅ Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L331-L334

[x] Not much gain here

✅ Read data directly

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L335-L338

[x] Not much gain here

✅ Read data from data

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L339-L342

[x] Not much gain here

❌ Read data from path(s)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L343-L379

Is a for loop over paths, calling readr::read_delim() for all and then dplyr::bind_rows()

[x] Maybe we can have a read_delim() that calls all the get_cols() internally rather than building it encoding and cols and na_cols beforehand? But maybe that is harder to debug?

This is how read_resource() would look like with the above changes:

read_resource <- function(package, resource_name, col_select = NULL) {
  # Get resource, includes check_package()
  resource <- get_resource(package, resource_name)

  # Get paths, schema and fields
  paths <- resource$path
  schema <- get_schema(package, resource_name)
  fields <- schema$fields
  field_names <- purrr::map_chr(fields, ~ purrr::pluck(.x, "name"))

  # Check all selected columns appear in schema
  if (!all(col_select %in% field_names)) {
    col_select_missing <- col_select[!col_select %in% field_names]
    cli::cli_abort(
      c(
        "Can't find column{?s} {.val {col_select_missing}} in field names.",
        "i" = "Field name{?s}: {.val {field_names}}."
      ),
      class = "frictionless_error_colselect_mismatch"
    )
  }

  # Create locale with encoding, decimal_mark and grouping_mark
  locale <- get_locale()

  # Create col_types: list(<collector_character>, <collector_logical>, ...)
  col_types <- get_cols()

  # Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/
  # Note that dialect can be NULL
  dialect <- read_descriptor(resource$dialect, package$directory, safe = TRUE)

  # Read data directly
  if (resource$read_from == "df") {
    df <- dplyr::as_tibble(resource$data)

  # Read data from data
  } else if (resource$read_from == "data") {
    df <- dplyr::as_tibble(do.call(rbind.data.frame, resource$data))

  # Read data from path(s)
  } else if (resource$read_from == "path" || resource$read_from == "url") {
    dataframes <- list()
    for (i in seq_along(paths)) {
      data <- readr::read_delim(
        file = paths[i],
        delim = dialect$delimiter %||% ",",
        quote = dialect$quoteChar %||% "\"",
        escape_backslash = if (dialect$escapeChar %||% "not set" == "\\") {
          TRUE
        } else {
          FALSE
        },
        escape_double = if (dialect$escapeChar %||% "not set" == "\\") {
          # If escapeChar is set, set doubleQuote to FALSE (mutually exclusive)
          FALSE
        } else {
          dialect$doubleQuote %||% TRUE
        },
        col_names = field_names,
        col_types = col_types,
        # Use rlang {{}} to avoid `col_select` to be interpreted as the name of
        # a column, see https://rlang.r-lib.org/reference/topic-data-mask.html
        col_select = {{col_select}},
        locale = locale,
        na = schema$missingValues %||% "",
        comment = dialect$commentChar %||% "",
        trim_ws = dialect$skipInitialSpace %||% FALSE,
        # Skip header row when present
        skip = if (dialect$header %||% TRUE) 1 else 0,
        skip_empty_rows = TRUE
      )
      dataframes[[i]] <- data
    }
    # Merge data frames for all paths
    df <- dplyr::bind_rows(dataframes)
  }

  return(df)
}

frictionlessdata / frictionless-r

Make `read_resource()` more modular #210