frictionlessdata / frictionless-r

R package to read and write Frictionless Data Packages
https://docs.ropensci.org/frictionless/
Other
27 stars 11 forks source link

Detect Data Package version with `version()` function #262

Open peterdesmet opened 2 weeks ago

peterdesmet commented 2 weeks ago

To support Data Package v2 we need to be able to detect the version used by a package.

Even if we would read profile, the end result would still be version = 1.0


In my opinion, the best way to implement this is with a version(package) function. This allows us in the future to create a version()<- function. Alternative names:

I think we can generalize to a version(list) function:

peterdesmet commented 2 weeks ago

@PietrH @sannegovaert @damianooldoni @khusmann thoughts?

@khusmann we would be able to detect what list we have (package, resource, schema) with the API change you suggested (#252), but currently we'll have to make due with dumb lists. 😄

khusmann commented 2 weeks ago

Hmm, I like version() but that it would alias base::version(). I wonder if it'd make sense to namespace our methods with a prefix, similar to how forcats does? forcats uses fct_*(); we could use fls_* or something? (So fls_version(), fls_schema() etc.)

Now that you bring this up, I'm realizing that the issue of supporting 2.0 is a little more complex than I was thinking at first. Do we want frictionless-r to upgrade v1 packages to v2 when they are loaded? If packages are upgraded, then all frictionless-r methods can all expect v2 schemas & features, which keeps implementation simple.

Simultaneous support of v1 and v2 seems potentially complex. We could:

  1. Use the same API, but then functions would sometimes need to change behavior when they were operating on a v1 vs v2 descriptor, and that might be hard to manage / maintain. It also creates potential unexpected behavior when mixing versions (reading from multiple sources working at different versions)...
  2. Namespace our API based on frictionless version, e.g. fls1_*() vs fls2_*()
  3. Create a new package for frictionless v2, e.g. library(frictionless2)

Option 3 would give us the most separation between versions and be easiest to support, I think. library(frictionless) would give you an API that would load & write frictionless v1, whereas library(frictionless2) would give you an API that would load and write frictionless v2. And for backwards compatibility, if you loaded a v1 package with library(frictionless2), it'd just get upgraded to v2.

Anyway, these are just some initial thoughts!

peterdesmet commented 2 weeks ago

As far as I'm aware, there is no base::version(), so there's no conflict. On prefixes: some tidyverse use it (like stringr or forcats) and others don't (like purrr and dplyr). I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp or fls doesn't feel obvious to me.

v2 support

I think I want the following design principle for supporting Data Package v2:

  1. An unchanged v1 package remains v1 and is written to disk as v1
  2. An unchanged v1 resource remains v1, even if it is part of a v2 package. I don't think this was discussed in the Data Package WG, but I believe it is supported/designed to be like this. I quite like the idea of not touching untouched resources.
  3. An added/replaced resource is a v2 resource and as a result, its containing package becomes v2.
  4. An unchanged v2 package/resource stays a v2 package/resource
flowchart TD
    p1["`**package (v1)**
    resource 1 (v1)`"]
    r1["resource 1 (v1)"]
    p2["`**package (v2)**
    resource 1 (v1)
    resource 2 (v2)`"]
    read_resource("read_resource()")
    add_resource("add_resource()")

    p1 --> read_resource --> r1
    p1 --> add_resource --> p2

However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour). Solutions:

  1. Have them use an older version of frictionless, which is not ideal, since updates in dependencies etc. might cause that old version to break.
  2. Have them provide a version to add_resource() (do you want this to be an v1 or v2 resource?). If they pick v2, the package becomes a v2 as well.
  3. Have them use a frictionless2 package. It's an interesting suggestion I hadn't considered yet.
    • Pros: Very clear for the user (and developers) + less complexity within functions + easier for contributors
    • Cons: more maintenance (e.g. feature support like #59) needs to be implemented twice or copied + doesn't really convey the message that Data Package v2 is backwards compatible with v1
    • Unsure: should a frictionless2 drop design principle 1 and potentially 3 and upconvert an incoming v1 (and its resources) to v2?

It is a choice we have to make soon though, because changes like https://github.com/frictionlessdata/frictionless-r/pull/258 don't make sense for frictionless if we opt for frictionless2

khusmann commented 2 weeks ago

As far as I'm aware, there is no base::version(), so there's no conflict.

Ah yes, it's not base::version(), it looks like it's just a global constant, not a function: base::version. But frictionless::version() would still mask this.

I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp or fls doesn't feel obvious to me.

Yeah, I don't like fls either -- but if we found a good abbrv it would help avoid aliasing & also be nice for autocomplete (e.g. type dp_<tab> would list all available frictionless verbs). That said, I don't feel strongly on this; just brainstorming.

I think I want the following design principle for supporting Data Package v2:

What you lay out here is similar to what I was imagining re: one api for both versions. It's do-able, but definitely introduces significant complexity, and gives me pause about long-term maintainability / edge cases.

However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour). Have them use an older version of frictionless, which is not ideal, since updates in dependencies etc. might cause that old version to break.

Exactly -- I think the ability to continue to support workflows in v1-land (without using old versions of frictionless-r) is very valuable.

Cons: more maintenance

I think the unfortunate reality is that if we want to continue the support of manipulating & writing v1 packages, we're stuck with the maintenance of v1 and v2 code-paths. The question is how we manage this complexity. Making v1 and v2 separate packages would keep these concerns separate and make it so that we don't have to think "how will this impact the v1 flow?" every time we do something in v2, and vise versa. Yes, it would duplicate some code, but in this case that's actually reducing complexity by decoupling the implementations.

doesn't really convey the message that Data Package v2 is backwards compatible with v1

I think a separate package actually gives us simpler design principles that mirror the v2 backwards compatibility rules:

So we're backwards compatible in that a v1 package is always 100% usable with both library(frictionless) and library(frictionless2), but if you want to use v2 features, you should use library(frictionless2)

I think supporting packages with mixed v1 and v2 resources would open a can of worms...

It is a choice we have to make soon though, because changes like #258 don't make sense for frictionless if we opt for frictionless2

Exactly, we're at a perfect crossroads if we like this direction: we'd freeze frictionless with the API it currently has and only update with bugfixes & v1 features. That'd then give us a clean slate & separate sandbox for frictionless2, where we are free to add features (and make API changes) without making v1 unstable.

PietrH commented 2 weeks ago

I'm not a big fan of creating a frictionless2 package, I think this will be confusing for new users just hearing about frictionless.

About version(), what do we expect users to do with this information? Ideally the package should just handle all that stuff and users don't need to care about what version they are reading. If it's (mostly) an internal function, I see little harm in calling it get_version() to avoid a namespace collision, even if the other functions in frictionless-r stop using verbs as a default.