Closed peterdesmet closed 2 months ago
This is how read_resource()
would look like with the above changes:
read_resource <- function(package, resource_name, col_select = NULL) {
# Get resource, includes check_package()
resource <- get_resource(package, resource_name)
# Get paths, schema and fields
paths <- resource$path
schema <- get_schema(package, resource_name)
fields <- schema$fields
field_names <- purrr::map_chr(fields, ~ purrr::pluck(.x, "name"))
# Check all selected columns appear in schema
if (!all(col_select %in% field_names)) {
col_select_missing <- col_select[!col_select %in% field_names]
cli::cli_abort(
c(
"Can't find column{?s} {.val {col_select_missing}} in field names.",
"i" = "Field name{?s}: {.val {field_names}}."
),
class = "frictionless_error_colselect_mismatch"
)
}
# Create locale with encoding, decimal_mark and grouping_mark
locale <- get_locale()
# Create col_types: list(<collector_character>, <collector_logical>, ...)
col_types <- get_cols()
# Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/
# Note that dialect can be NULL
dialect <- read_descriptor(resource$dialect, package$directory, safe = TRUE)
# Read data directly
if (resource$read_from == "df") {
df <- dplyr::as_tibble(resource$data)
# Read data from data
} else if (resource$read_from == "data") {
df <- dplyr::as_tibble(do.call(rbind.data.frame, resource$data))
# Read data from path(s)
} else if (resource$read_from == "path" || resource$read_from == "url") {
dataframes <- list()
for (i in seq_along(paths)) {
data <- readr::read_delim(
file = paths[i],
delim = dialect$delimiter %||% ",",
quote = dialect$quoteChar %||% "\"",
escape_backslash = if (dialect$escapeChar %||% "not set" == "\\") {
TRUE
} else {
FALSE
},
escape_double = if (dialect$escapeChar %||% "not set" == "\\") {
# If escapeChar is set, set doubleQuote to FALSE (mutually exclusive)
FALSE
} else {
dialect$doubleQuote %||% TRUE
},
col_names = field_names,
col_types = col_types,
# Use rlang {{}} to avoid `col_select` to be interpreted as the name of
# a column, see https://rlang.r-lib.org/reference/topic-data-mask.html
col_select = {{col_select}},
locale = locale,
na = schema$missingValues %||% "",
comment = dialect$commentChar %||% "",
trim_ws = dialect$skipInitialSpace %||% FALSE,
# Skip header row when present
skip = if (dialect$header %||% TRUE) 1 else 0,
skip_empty_rows = TRUE
)
dataframes[[i]] <- data
}
# Merge data frames for all paths
df <- dplyr::bind_rows(dataframes)
}
return(df)
}
read_resource()
is very long. It could benefit from a more modular approach.✅ Get resource, includes check_package()
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L202-L204
✅ Get paths, schema and fields
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L205-L210
✅ Check all selected columns appear in schema
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L211-L222
❌ Create locale with encoding, decimal_mark and grouping_mark
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L223-L262
get_encoding()
(orencoding()
)resource$encoding
andresource$fields
, so best to passresource
(andpackage
)readr::locale
❌ Create col_types: list(, , ...)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L263-L327
What we want is:
get_cols()
(orcols()
). It could work on a single field, but more useful is that it always returns areadr::cols()
get_na_cols()
(orna_cols()
) cf. https://kylehusmann.com/interlacer/reference/na_cols.htmlget_col_type()
(orcol()
) for a single field (soget_cols()
is mostly a wrapper).So:
❌ Assign names: list("name1" =, "name2" = ...)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L328-L330
get_cols()
✅ Select CSV dialect, see https://specs.frictionlessdata.io/csv-dialect/
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L331-L334
✅ Read data directly
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L335-L338
✅ Read data from data
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L339-L342
❌ Read data from path(s)
https://github.com/frictionlessdata/frictionless-r/blob/72b86941a15f84b1ca1416e2c4d48ee73d3fd3ff/R/read_resource.R#L343-L379
Is a for loop over paths, calling
readr::read_delim()
for all and thendplyr::bind_rows()
read_delim()
that calls all theget_cols()
internally rather than building itencoding
andcols
andna_cols
beforehand? But maybe that is harder to debug?