Open peterdesmet opened 2 weeks ago
@PietrH @sannegovaert @damianooldoni @khusmann thoughts?
@khusmann we would be able to detect what list we have (package, resource, schema) with the API change you suggested (#252), but currently we'll have to make due with dumb lists. 😄
Hmm, I like version()
but that it would alias base::version()
. I wonder if it'd make sense to namespace our methods with a prefix, similar to how forcats does? forcats uses fct_*()
; we could use fls_*
or something? (So fls_version()
, fls_schema()
etc.)
Now that you bring this up, I'm realizing that the issue of supporting 2.0 is a little more complex than I was thinking at first. Do we want frictionless-r to upgrade v1 packages to v2 when they are loaded? If packages are upgraded, then all frictionless-r methods can all expect v2 schemas & features, which keeps implementation simple.
Simultaneous support of v1 and v2 seems potentially complex. We could:
fls1_*()
vs fls2_*()
library(frictionless2)
Option 3 would give us the most separation between versions and be easiest to support, I think. library(frictionless)
would give you an API that would load & write frictionless v1, whereas library(frictionless2)
would give you an API that would load and write frictionless v2. And for backwards compatibility, if you loaded a v1 package with library(frictionless2)
, it'd just get upgraded to v2.
Anyway, these are just some initial thoughts!
As far as I'm aware, there is no base::version()
, so there's no conflict. On prefixes: some tidyverse use it (like stringr or forcats) and others don't (like purrr and dplyr). I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp
or fls
doesn't feel obvious to me.
I think I want the following design principle for supporting Data Package v2:
flowchart TD
p1["`**package (v1)**
resource 1 (v1)`"]
r1["resource 1 (v1)"]
p2["`**package (v2)**
resource 1 (v1)
resource 2 (v2)`"]
read_resource("read_resource()")
add_resource("add_resource()")
p1 --> read_resource --> r1
p1 --> add_resource --> p2
However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour). Solutions:
add_resource()
(do you want this to be an v1 or v2 resource?). If they pick v2, the package becomes a v2 as well.It is a choice we have to make soon though, because changes like https://github.com/frictionlessdata/frictionless-r/pull/258 don't make sense for frictionless if we opt for frictionless2
As far as I'm aware, there is no
base::version()
, so there's no conflict.
Ah yes, it's not base::version()
, it looks like it's just a global constant, not a function: base::version
. But frictionless::version()
would still mask this.
I think it makes the most sense if there is an obvious two-or-three-letter abbreviation, dp or fls doesn't feel obvious to me.
Yeah, I don't like fls either -- but if we found a good abbrv it would help avoid aliasing & also be nice for autocomplete (e.g. type dp_<tab>
would list all available frictionless verbs). That said, I don't feel strongly on this; just brainstorming.
I think I want the following design principle for supporting Data Package v2:
What you lay out here is similar to what I was imagining re: one api for both versions. It's do-able, but definitely introduces significant complexity, and gives me pause about long-term maintainability / edge cases.
However, doing so doesn't give users the choice of manipulating a Data Package, but having it stay a v1 (i.e. the current frictionless behaviour). Have them use an older version of frictionless, which is not ideal, since updates in dependencies etc. might cause that old version to break.
Exactly -- I think the ability to continue to support workflows in v1-land (without using old versions of frictionless-r) is very valuable.
Cons: more maintenance
I think the unfortunate reality is that if we want to continue the support of manipulating & writing v1 packages, we're stuck with the maintenance of v1 and v2 code-paths. The question is how we manage this complexity. Making v1 and v2 separate packages would keep these concerns separate and make it so that we don't have to think "how will this impact the v1 flow?" every time we do something in v2, and vise versa. Yes, it would duplicate some code, but in this case that's actually reducing complexity by decoupling the implementations.
doesn't really convey the message that Data Package v2 is backwards compatible with v1
I think a separate package actually gives us simpler design principles that mirror the v2 backwards compatibility rules:
library(frictionless)
reads and writes v1 packages.library(frictionless2)
reads v1 and v2 packages, but only writes v2 packages. (So v1 packages are instantly converted to v2 with a warning or something when they are loaded)So we're backwards compatible in that a v1 package is always 100% usable with both library(frictionless)
and library(frictionless2)
, but if you want to use v2 features, you should use library(frictionless2)
I think supporting packages with mixed v1 and v2 resources would open a can of worms...
It is a choice we have to make soon though, because changes like #258 don't make sense for frictionless if we opt for frictionless2
Exactly, we're at a perfect crossroads if we like this direction: we'd freeze frictionless with the API it currently has and only update with bugfixes & v1 features. That'd then give us a clean slate & separate sandbox for frictionless2, where we are free to add features (and make API changes) without making v1 unstable.
I'm not a big fan of creating a frictionless2
package, I think this will be confusing for new users just hearing about frictionless.
About version()
, what do we expect users to do with this information? Ideally the package should just handle all that stuff and users don't need to care about what version they are reading. If it's (mostly) an internal function, I see little harm in calling it get_version()
to avoid a namespace collision, even if the other functions in frictionless-r
stop using verbs as a default.
To support Data Package v2 we need to be able to detect the version used by a package.
$schema
is undefinedversion
=1.0
profile
property in this case (see backward compatibility note) but frictionless-r ignores this property since it doesn't use it (it is useful for validation etc.).$schema
=https://datapackage.org/profiles/1.0/datapackage.json
version
=1.0
profile
is ignored (since new property$schema
is used)$schema
=https://datapackage.org/profiles/2.0/datapackage.json
version
=2.0
profile
is ignored (deprecated in 2.0)$schema
=https://datapackage.org/profiles/2.1-rc.1/datapackage.json
version
=2.1-rc.1
(theoretical example)profile
is ignored$schema
=https://fiscal.datapackage.org/profiles/fiscal-data-package.json
,https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json
or any other valueversion
=>=2.0
: we can't detect the version, but it is higher or equal to2.0
since$schema
is used.profile
is ignoredEven if we would read
profile
, the end result would still beversion = 1.0
profile
is undefinedprofile
is assumed to bedata-package
(see https://specs.frictionlessdata.io/profiles/#introduction)version
= 1.0 (implied byprofile
use)profile
=data-package
version
= 1.0 (implied byprofile
use)profile
=tabular-data-package
,fiscal-data-package
,https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json
or any other valueversion
= 1.0 (implied byprofile
use)In my opinion, the best way to implement this is with a
version(package)
function. This allows us in the future to create aversion()<-
function. Alternative names:get_version()
: this limits us in the future from having aversion()
. I'm tempted to rename all functions that start withget_
package_version()
: the version logic is the same for package, resource, dialect, schema: it's just the name of the file in the URL that is different (datapackage.json
,dataresource.json
). I therefore think we can make one function for all of these, rather than four functions.I think we can generalize to a
version(list)
function:$schema
and get the version from the URL if it starts withhttps://datapackage.org
, otherwise use1.0
(if undefined$schema
) or>2.0
.https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/camtrap-dp-profile.json
if it's a package, resource, etc. I think this can be solved with an extra argument in the set function:version(level = "schema") <- 2.0
would assignhttps://datapackage.org/profiles/2.0/tableschema.json