frictionlessdata / frictionless-r

R package to read and write Frictionless Data Packages
https://docs.ropensci.org/frictionless/
Other
27 stars 11 forks source link

Make `read_resource()` more modular #210

Closed peterdesmet closed 2 months ago

peterdesmet commented 3 months ago

read_resource() is very long. It could benefit from a more modular approach.

✅ Get resource, includes check_package()

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L202-L204

✅ Get paths, schema and fields

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L205-L210

✅ Check all selected columns appear in schema

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L211-L222

❌ Create locale with encoding, decimal_mark and grouping_mark

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L223-L262

❌ Create col_types: list(, , ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L263-L327

What we want is:

cols(
  field1 = col_integer(),
  field2 = col_character(),
  ...
)

So:

get_cols <- function(fields) {
  for field:
    get_col(field)
}

get_col <- function(field) {
  type <-
  switch(type,
    "string" = col_string(),
    "number" = col_number(),
    ...
  )
}

col_string <- function() ...
col_number <- function(field) {
  get_enum()
  get_bare_number()
  get_format()
  ...
} 

❌ Assign names: list("name1" = , "name2" = ...)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L328-L330

✅ Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L331-L334

✅ Read data directly

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L335-L338

✅ Read data from data

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L339-L342

❌ Read data from path(s)

https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L343-L379

Is a for loop over paths, calling readr::read_delim() for all and then dplyr::bind_rows()

peterdesmet commented 3 months ago

This is how read_resource() would look like with the above changes:

read_resource <- function(package, resource_name, col_select = NULL) {
  # Get resource, includes check_package()
  resource <- get_resource(package, resource_name)

  # Get paths, schema and fields
  paths <- resource$path
  schema <- get_schema(package, resource_name)
  fields <- schema$fields
  field_names <- purrr::map_chr(fields, ~ purrr::pluck(.x, "name"))

  # Check all selected columns appear in schema
  if (!all(col_select %in% field_names)) {
    col_select_missing <- col_select[!col_select %in% field_names]
    cli::cli_abort(
      c(
        "Can't find column{?s} {.val {col_select_missing}} in field names.",
        "i" = "Field name{?s}: {.val {field_names}}."
      ),
      class = "frictionless_error_colselect_mismatch"
    )
  }

  # Create locale with encoding, decimal_mark and grouping_mark
  locale <- get_locale()

  # Create col_types: list(<collector_character>, <collector_logical>, ...)
  col_types <- get_cols()

  # Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/
  # Note that dialect can be NULL
  dialect <- read_descriptor(resource$dialect, package$directory, safe = TRUE)

  # Read data directly
  if (resource$read_from == "df") {
    df <- dplyr::as_tibble(resource$data)

  # Read data from data
  } else if (resource$read_from == "data") {
    df <- dplyr::as_tibble(do.call(rbind.data.frame, resource$data))

  # Read data from path(s)
  } else if (resource$read_from == "path" || resource$read_from == "url") {
    dataframes <- list()
    for (i in seq_along(paths)) {
      data <- readr::read_delim(
        file = paths[i],
        delim = dialect$delimiter %||% ",",
        quote = dialect$quoteChar %||% "\"",
        escape_backslash = if (dialect$escapeChar %||% "not set" == "\\") {
          TRUE
        } else {
          FALSE
        },
        escape_double = if (dialect$escapeChar %||% "not set" == "\\") {
          # If escapeChar is set, set doubleQuote to FALSE (mutually exclusive)
          FALSE
        } else {
          dialect$doubleQuote %||% TRUE
        },
        col_names = field_names,
        col_types = col_types,
        # Use rlang {{}} to avoid `col_select` to be interpreted as the name of
        # a column, see https://rlang.r-lib.org/reference/topic-data-mask.html
        col_select = {{col_select}},
        locale = locale,
        na = schema$missingValues %||% "",
        comment = dialect$commentChar %||% "",
        trim_ws = dialect$skipInitialSpace %||% FALSE,
        # Skip header row when present
        skip = if (dialect$header %||% TRUE) 1 else 0,
        skip_empty_rows = TRUE
      )
      dataframes[[i]] <- data
    }
    # Merge data frames for all paths
    df <- dplyr::bind_rows(dataframes)
  }

  return(df)
}