frictionlessdata / frictionless-r

R package to read and write Frictionless Data Packages
https://docs.ropensci.org/frictionless/
Other
28 stars 11 forks source link

Create function `update_schema()` to edit field properties #70

Open peterdesmet opened 2 years ago

peterdesmet commented 2 years ago

A created schema will only have the field properties name, type and (sometimes) constraints. I see it as fairly common to add more properties, such as description, required etc. It is possible to do that with purrr, but it isn't very straightforward. Maybe a specific function would be useful.

Create schema:

library(frictionless)
iris_schema <- create_schema(iris)
str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

Atomic function

iris_schema <- edit_field_property(iris_schema, "Sepal.Width", "description", "Sepal width in cm.")
# Same as: iris_schema$fields[[2]]$description <- "Sepal width in cm."

Not sure this is super useful, but it is very clear what field you are setting.

Loop function

iris_schema <- edit_fields(
  iris_schema,
  "description",
  c("Sepal length in cm.", "Sepal width in cm.", "Petal length in cm.", "Petal width in cm.", NA_character_)
)
# If value is NA or NULL, don't set property

Faster, but disconnect between field name and value you want to set.

Recode like function

iris_schema <- edit_fields(
  iris_schema,
  "description",
  "Sepal.length" = "Sepal length in cm.",
  "Sepal.width" = "Sepal width in cm.",
  "Species" = NA_character
)
# If field is not listed, don't set property
# If field is listed but NA or NULL, remove it

Note, it should also work for nested properties:

iris_schema <- edit_fields(
  iris_schema,
  "constraints$required",
  "Sepal.length" = true
)
damianooldoni commented 2 years ago

After our short chat, I completely agree on the benefit of having such a function in this package to cover basic and quite typical steps of handling data packages. Some thoughts:

  1. I think it will be still important to show in documentation how purrr function imap is used within edit_fields. In this way users can be inspired and write their own custom functions for cases way too specific for being included in the package. Sooner or later something like that will happen.
  2. I like the loop approach but it's true: there is no link between field name and value to set, so bad mistakes can occurr! Unless you use named vectors! 👍 See below an example.
  3. the recode like function is nice and easy to use as it is very tidyverse-like. The only drawback is its verbosity when it has to be applied to many fields: typos can arise as users have to write a lot within the same function. This is the reason why I seldom use recode in my daily life 😄

I think we should go for the loop option. And here below I show you a simple way to solve the drawback by using a named vector:

# get field names
field_names <- map_chr(iris_schema$fields, ~ .$name)

field_names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# define values as a named vector
values <- c("Sepal length in cm.", "Sepal width in cm.", "Petal length in cm.", "Petal width in cm.", NA_character_)
names(values) <- field_names
values

iris_schema <- edit_fields(
    iris_schema,
    "description",
    values
  )

So, if the user provides an unnamed vector, then the order of the fields is used: maybe a message can be returned providing the order the function will use. Otherwise, the values are set based on the field names defined in the names.

@peterdesmet: in this way I think the loop function will match all our expectations. What do you think?

peterdesmet commented 2 years ago

Also suggested by @beatrizmilz in https://github.com/ropensci/software-review/issues/495#issuecomment-1025860861:

Adding the descriptions to the schema does not seem trivial. There is an example with the purrr package. But the example might be not simple to understand if someone is not used to the purrr package.

I`m talking about this piece of code:

iris_schema <- create_schema(iris)

# Remove description for first field
iris_schema$fields[[1]]$description <- NULL

# Set descriptions for all fields
descriptions <- c(
  "Sepal length in cm.",
  "Sepal width in cm.",
  "Pedal length in cm.",
  "Pedal width in cm.",
  "Iris species."
)
iris_schema$fields <- purrr::imap(
  iris_schema$fields,
  ~ c(.x, description = descriptions[.y])
)

Do the authors think that it is possible to create a function to add descriptions to the schema, in a way that is used in a similarly to the other functions of the package? Example of the idea:

iris_schema <- create_schema(iris) |>
  add_description(
    c(
      "Sepal length in cm.",
      "Sepal width in cm.",
      "Pedal length in cm.",
      "Pedal width in cm.",
      "Iris species."
    )
  )
peterdesmet commented 7 months ago

Finally got some time to think about this.

Workflow

  1. Create a schema first:
schema <-
  PlantGrowth %>%
  create_schema()
str(schema)
#> List of 1
#>  $ fields:List of 2
#>   ..$ :List of 2
#>   .. ..$ name: chr "weight"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "group"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "ctrl" "trt1" "trt2"
  1. Properties can be added to each field in the schema by providing an unnamed vector to update_schema() (cf. to what @damianooldoni suggested above). Properties are added based on field order. Here we only provide a vector of length 1, so only the first field gets a property.
schema <-
  schema %>%
  update_schema(
    property = "unit",
    values = c("g")
  )
str(schema)
#> List of 1
#>  $ fields:List of 2
#>   ..$ :List of 2
#>   .. ..$ name: chr "weight"
#>   .. ..$ type: chr "number"
#>   .. ..$ unit: chr "g" <--------
#>   ..$ :List of 3
#>   .. ..$ name       : chr "group"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "ctrl" "trt1" "trt2"
  1. Properties can also be added by providing a named vector to update_schema(). The convenience function field_names() is used to name the vector (#196):
descriptions <- c("Weight of the plant", "Group the plant is in")
names(description) <- field_names(schema)
names(descriptions) <- names
descriptions
#>                  weight                   group 
#>   "Weight of the plant" "Group the plant is in"
schema <-
  schema %>%
  update_schema(
    property = "description",
    values = descriptions
  )
str(schema)
#> List of 1
#>  $ fields:List of 2
#>   ..$ :List of 2
#>   .. ..$ name: chr "weight"
#>   .. ..$ type: chr "number"
#>   .. ..$ unit: chr "g"
#>   .. ..$ description: chr "Weight of the plant" <--------
#>   ..$ :List of 3
#>   .. ..$ name       : chr "group"
#>   .. ..$ type       : chr "string"
#>   .. ..$ description: chr "Group the plant is in" <--------
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "ctrl" "trt1" "trt2"
  1. You can't update reserved properties:
schema <-
  schema %>%
  update_schema(
    property = "name",
    name = c("foo")
  )
#' Error: "name" is a reserved field property.
  1. The resource with the custom made schema can be added to a package:
package <-
  create_package() %>%
  add_resource("plant-growth", PlantGrowth, schema = schema)
  1. If you want to update a schema of an already attached resource (not advised), you can be assigning it directly:
package$resources[[1]]$schema <- schema

Function name

I'm tempted to go for update_schema() rather than edit_fields(). update_fields() would be a valuable alternative, it's just clear that it returns a schema (not fields).

get_schema(package, resource_name) => schema
create_schema(df) => schema
update_schema(schema) => schema <-----
field_names(schema) => vector

@damianooldoni @PietrH @nepito what do you think?