frictionlessdata / frictionless-r

R package to read and write Frictionless Data Packages
https://docs.ropensci.org/frictionless/
Other
28 stars 10 forks source link

unexpected behaviour for metadata with NULL or NA content #203

Open ElsLommelen opened 2 months ago

ElsLommelen commented 2 months ago

When adding a metadata item without content (e.g. because all other tables or columns have content for this metadata item), writing and reading the package alters this content: NA becomes NULL, and NULL becomes Named list(), and also the written datapackage.json is different. I don't mind the distinction between NA or NULL, so I don't mind if they would be saved and reloaded as the same value (or not at all), but I find annoying that the written file changes when first giving the metadata value NA (which is kind of a default given by functions in R if no data are available).

The reprex demonstrates the issue for metadata on the table level, but metadata on the column level behave similar.

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.3

# creating a package with metadata title = NULL and description = NA
my_package <-
  create_package() |>
  add_resource(
    resource_name = "iris",
    data = iris,
    title = NULL,
    description = NA
  )
str(my_package)
#> List of 2
#>  $ resources:List of 1
#>   ..$ :List of 10
#>   .. ..$ name       : chr "iris"
#>   .. ..$ data       :'data.frame':   150 obs. of  5 variables:
#>   .. .. ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>   .. .. ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>   .. .. ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>   .. .. ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>   .. .. ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..$ profile    : chr "tabular-data-resource"
#>   .. ..$ format     : NULL
#>   .. ..$ mediatype  : NULL
#>   .. ..$ encoding   : NULL
#>   .. ..$ dialect    : NULL
#>   .. ..$ title      : NULL
#>   .. ..$ description: logi NA
#>   .. ..$ schema     :List of 1
#>   .. .. ..$ fields:List of 5
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 3
#>   .. .. .. .. ..$ name       : chr "Species"
#>   .. .. .. .. ..$ type       : chr "string"
#>   .. .. .. .. ..$ constraints:List of 1
#>   .. .. .. .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
#>  $ directory: chr "."
#>  - attr(*, "class")= chr [1:2] "datapackage" "list"

# writing the package
write_package(my_package, "irisdir")

# in datapackage.json, title = {} and description = null

# when reading the package again, title = Named list() and description = NULL
my_loaded_package <- read_package("irisdir/datapackage.json")
str(my_loaded_package)
#> List of 2
#>  $ resources:List of 1
#>   ..$ :List of 9
#>   .. ..$ name       : chr "iris"
#>   .. ..$ path       : chr "iris.csv"
#>   .. ..$ profile    : chr "tabular-data-resource"
#>   .. ..$ format     : chr "csv"
#>   .. ..$ mediatype  : chr "text/csv"
#>   .. ..$ encoding   : chr "utf-8"
#>   .. ..$ title      : Named list()
#>   .. ..$ description: NULL
#>   .. ..$ schema     :List of 1
#>   .. .. ..$ fields:List of 5
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Sepal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Length"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 2
#>   .. .. .. .. ..$ name: chr "Petal.Width"
#>   .. .. .. .. ..$ type: chr "number"
#>   .. .. .. ..$ :List of 3
#>   .. .. .. .. ..$ name       : chr "Species"
#>   .. .. .. .. ..$ type       : chr "string"
#>   .. .. .. .. ..$ constraints:List of 1
#>   .. .. .. .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
#>  $ directory: chr "irisdir"
#>  - attr(*, "class")= chr [1:2] "datapackage" "list"

write_package(my_loaded_package, "irisdir2")

# and in this datapackage.json, title = {} and description = {}

Created on 2024-04-16 with reprex v2.0.2

peterdesmet commented 2 months ago

Thanks for reporting. We have a helper function clean_list() that allows to sanitize NULL, list() etc. We could run it on resource or datapackage before writing, but I'm afraid it might have unintended side effects.

It's probably better to extend this line:

https://github.com/frictionlessdata/frictionless-r/blob/5024c909e67591a6eae9347e31ac99d6fa795749/R/write_package.R#L77C29-L77C35

With the properties null = "null" and na = "null", so values are always exported the same way (as NULL, which is the default for lists).

peterdesmet commented 2 months ago

Using na = "string" would cause all NA values to be exported as "NA" (and thus different than NULL values). This is probably not desirable, since reading the package would not interpret those automatically as NA. At least NULL has an inherent meaning in lists.

ElsLommelen commented 2 months ago

With the properties null = "null" and na = "null", so values are always exported the same way (as NULL, which is the default for lists).

This indeed seems a good solution: now it is null = "list" and na = "null", I suppose, and replacing the first by null = "null" would give the same behaviour after writing for NULL and NA