frictionlessdata / frictionless-r

R package to read and write Frictionless Data Packages
https://docs.ropensci.org/frictionless/
Other
28 stars 11 forks source link

Should we use `resource` as an argument? #251

Open peterdesmet opened 3 months ago

peterdesmet commented 3 months ago

The following helper functions have package and resource_name as arguments:

But they are called from functions that already have the resource. That means that currently, they recall get_resource(), in the case for locale() for a third time: read_resource() -> read_from_path() -> locale()

This could be simplified by passing resource as an argument

get_schema() needs package$directory

The problem with get_schema() is that is needs package$directory, so merely passing the resource is not enough. Solutions:

  1. Adding a directory argument to get_schema(): no, this is clunky
  2. Copying the package directory to resource as part of get_resource(). That way, get_schema() can get it from there. It can be removed as part of write_resource()
  3. Other?

Using resource for get_schema()

If the helper functions use resource as an argument, one could argue to also use it for the public function get_schema(resource). If we do that, get_resource() should become a public function too, which can be useful (especially for developers). The downside is that (public) functions that use resource should then check if it is valid. Most of the get_resource() checks should then move to a check_resource() function.

@damianooldoni @khusmann others... Thoughts?

damianooldoni commented 3 months ago

Thanks @peterdesmet for the nice summary. Here below my two cents

get_schema() needs package$directory

I like option 2, adding directory to resource. Maybe you would like to add it as attribute instead? Benefit: you don't need to remove it in write_resource. It is in some way "hidden", but still accessible, e.g. attr(x$resources[[1]], "package_directory").

Here a reprex:

library(frictionless)

x <- example_package()

# Add directory as attribute of the resources
x$resources <- purrr::map(x$resources, ~ {
  attr(.x, "package_directory") <- x$directory
  .x
})

# Show the added attribute by printing all attributes of the resources
purrr::walk(x$resources, ~ print(attributes(.x)))
#> $names
#> [1] "name"      "path"      "profile"   "title"     "format"    "mediatype"
#> [7] "encoding"  "schema"   
#> 
#> $package_directory
#> [1] "C:/R/library/frictionless/extdata"
#> 
#> $names
#> [1] "name"      "path"      "profile"   "title"     "format"    "mediatype"
#> [7] "encoding"  "schema"   
#> 
#> $package_directory
#> [1] "C:/R/library/frictionless/extdata"
#> 
#> $names
#> [1] "name"    "data"    "profile" "title"   "schema" 
#> 
#> $package_directory
#> [1] "C:/R/library/frictionless/extdata"

Created on 2024-07-18 with reprex v2.1.0

Using resource for get_schema()

Indeed, I agree with all the statements here. In particular, it sounds to me not completely logical that the a user can read the data of a resource, but not getting it "as it is".

Moving the resource package checks to a check_resource() is also very nice as it makes the checks more modular and so better software speaking (better to debug/refactor/read...).

khusmann commented 3 months ago

Maybe you would like to add it as attribute instead?

For a while I've been thinking about an approach close to this, but where resources are loaded by read_resource as subclassed tibbles and contain all the resource metadata in a "frictionless" attribute. (This would scope all the frictionless attrs inside the tibble, separating them from other tibble attrs related to its display, e.g. class, row.names, etc).

But I guess here we're talking about just resource descriptors, without their data loaded. Still, I think a resource class would be an interesting approach -- this class could be upgraded to the subclassed tibble when the user decided to load data (but then after loading data the object would still work with all the original descriptor fns)

The downside is that (public) functions that use resource should then check if it is valid.

The nice thing about making the resource descriptor its own class, is that you wouldn't need to always be scanning the descriptor to see if it was valid if you only allowed valid resources to be stamped with the custom class. (So the class operates in a sort of newtype pattern)

This would mean that you would not want to allow resource descriptors to be mutated directly; instead you'd provide an API that would allow the user to add / remove descriptor properties but always verify they resulted in a valid descriptor. (Similar to what I suggest in #198)

I'll see about pulling together an example to demonstrate more of what I mean...