frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
493 stars 112 forks source link

Write down versioning / changing rules #858

Open peterdesmet opened 9 months ago

peterdesmet commented 9 months ago

Hi all, the communication on the Frictionless specs update names it v2 (version 2, see also #853 #857). The announcement blog post also states (emphasis mine):

The modular approach will of course still be the cornerstone of the Frictionless specs v2, and we won’t introduce any breaking changes.

I'm very happy no breaking changes will be introduced, I think that should be a guiding principle. But following semantic versioning, the specs update should then be a minor version. Given that all major specs† are currently v1, I would argue that the upcoming release is v1.1.

I understand that v2 indicates that there is serious momentum behind the current development (dedicated project, new website). But to anyone who's not closely following Frictionless v2 seems like it is a major overhaul without backward compatibility. A v1.1 would (correctly) communicate that while Data Package is now its own standard and most things will work as expected. It also sets us on a path to incorporate more changes in future (minor) releases.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

†All major specs are v1: Data Package, Tabular Data Package, Data Resource, Tabular Data Resource and Table Schema. The exception is CSV Dialect which is v1.2, but it seems this one is renamed to Table dialect so one could argue to start over. Some of the other experimental specs (like Fiscal Package or Views) have other version numbers like 1.0-rc.1 and 1.0-beta.

khusmann commented 9 months ago

+1 -- When I heard the v2 announcement, I immediately assumed it would include breaking changes and was surprised to find it was going to be backwards compatible.

Was v2 chosen because v1.1 felt like it wasn't communicating enough "distance" from v1.0 given the new website, dplib, etc.? If so, a jump to v1.5 might be another option to create separation before/after this initiative, which I would interpret as "major overhaul but no breaking changes".

... that said my opinion isn't very strong on this, so I'm happy to defer to whatever strategy has the most consensus/momentum.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

I think this is an excellent question and definitely warrants further discussion. How it is handled seems intertwined with the standard's governance structure / processes moving forward... Is this the sort of thing we want to/are planning to cover in the working group?

nichtich commented 9 months ago

I would not be surprised if there will be an edge case of some artificial piece of data being compliant with 1.0 but not with the new version because the existing wording allows things not planned to be allowed. Moreover I think a version 2.0 will more attract than discourage use.

fjuniorr commented 9 months ago

I don't even think we will need artificial data to hit this problem. https://github.com/frictionlessdata/specs/issues/379 and https://github.com/frictionlessdata/specs/issues/697 are breaking changes likely to be discussed which at some point were added[^20231222T082231] to frictionless-py v5.

[^20231222T082231]: I think https://github.com/frictionlessdata/specs/issues/379 was removed after https://github.com/frictionlessdata/frictionless-py/issues/868 but frictionless-py 5.16.0 converts "dialect": {"delimiter": ";"} to "dialect": {"csv": {"delimiter": ";"}} unless system.standards = "v1" is specified. I noticed this after having some difficulties in creating data packages that would play nice with both frictionless-py and frictionless-r.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

Thinking about "communication simplicity" I think they should be versioned as a whole. This quote from @roll captures the problem quite well:

For example, we would like to make our Python libs 100% compatible/implementing the specs. TBH at the moment, I don't really understand what does it mean. Whether there is a frozen v1 of the specs to be compatible with and where all the current spec changes go v1.1/v2 branch of the specs etc

To give another example, I can see how frictionless-r could support Tabular Data Resource v2 with https://github.com/frictionlessdata/specs/issues/379 but not support CSV/Table Dialect v2 with https://github.com/frictionlessdata/specs/issues/697. However this creates an explosion on the number of ways a client could be "standard compliant" creating confusion for users.

roll commented 9 months ago

I think it's a valid point, and as a Working Group, we can vote on the version when we have finished the changelog.

Peter outlined the pros of staying on v1.1 so I'll add some arguments in favor of v2:

TBH, I'm not sure if the specs need 100% compliance to semver as it's not software. For example, JSON Schema versioning has been like Draft X for years and now it's yyyy-mm based. Honestly speaking, those Draft X looked really weird but actually they kinda worked implementors just thought about being compliant with draft "version X"

roll commented 9 months ago

@peterdesmet I think we need to work with core standard and domain-specific extensions as projects so it will be core vX, camtrap vY, fiscal vZ etc. So I would just version datapackage repository as whole (I guess you do the same for camtrap).

PS. Fiscal Data Package as a domain-specific extension moved to its own project - https://github.com/frictionlessdata/datapackage-fiscal

khusmann commented 8 months ago

I just realized "backwards compatibility" / "no breaking changes" has different levels/types of strict-ness, and I'm not clear where we stand:

1) An implementation designed for v2 spec should be equally capable of reading v1 data packages

2) An implementation designed for v1 spec should be capable of reading v2 data packages (albeit with reduced features)

Different types of modifications to the spec break in different ways:

etc.

In general, it's easier to upgrade software than existing data artifacts... so I'd argue we should hold to (1) and relax (2) to give us more freedom for v2 improvements. It also puts me squarely in the v2 semver camp because although a given v2 spec implementation will be "backwards compatible with v1 data", it still is "breaking" in that v2 data will not necessarily work with a v1 implementation.

peterdesmet commented 8 months ago

Thanks @khusmann for the summary, I complete agree that we should hold to (1) and relax (2), i.e. future software application should still be able to read v1 data packages (since those will be around for a long time), but can be slow in adopting new features of v2.

I draw a different conclusion regarding the versioning though, since a v2 spec sounds (to me) that software implementations can at some point give up on v1. A v1.1 indicates that this is still within the same major version of the spec.

roll commented 8 months ago

@peterdesmet Answering https://github.com/frictionlessdata/datapackage/pull/12#issuecomment-1881247519 as I think it will be good to have everything related to the versioning discussion in one place.

Why is it structurally non-breaking for implementations?

By structurally breaking change I mean something that will fail all the implementations on the next nightly-build. It will happen if we do a breaking change to one of JSON Schema profiles e.g. changing schema.fields to be a mapping instead of an array.

Unfortunately, as the specs in some places were written very broadly, we also have a grey zone. Maybe finiteNumber was a bad example of it but something like any format for dates. The specs just say that it's implementation specific so e.g. changing this will be implementation-specific breaking.

So in my head for v2 I have these tiers (and my opinion on change possibility):

roll commented 8 months ago

Also, it's the specifics of working on standards that many kinds of new features (a property added) don't have full forward-compat as e.g. a new constraint will kind of break validation completeness of the current implementations. So maybe this kind of changes might differentiate major and minor in our case. E.g.:

peterdesmet commented 8 months ago

@roll, since you wanted everything related to versioning be part of this discussion, I'm also referring to this comment by @khughitt and me regarding implementations retrieving or detecting the version of the Data Package spec:

Tangential but, this makes me wonder whether it would make sense to modify the validation machinery to support validating against earlier versions of the spec?

That would be useful, but rather than implementations (or users) guessing what version of the spec was used for a datapackage.json, it will likely be good if that was indicated. I don't think this is currently possible?

roll commented 8 months ago

I think on the Standard side, we need to decide whether we provide standard version information for an individual descriptor e.g. as proposed here https://github.com/frictionlessdata/specs/issues/444

I think every implementation is free to decide how to handle it as it's just about resources. E.g. some implementation can have a feature that it validates against versions X, Y, and Z. And some just against Y

Note, that currently we consider datapackage.json to be versionless

peterdesmet commented 8 months ago

I think the rules for changing the Data Package spec should be declared (on the spec website or elsewhere). I currently find it difficult to assess if PR follow the rules. Here's a first attempt:

General rules

(in line with @khusmann's statement that software is easier to update than data artifacts https://github.com/frictionlessdata/specs/issues/858#issuecomment-1885401944)

  1. An existing datapackage.json that is valid MUST NOT becoming invalid in the future.
  2. A new datapackage.json MAY be invalid because a software implementation does not support the latest version of the specification (yet).

Because of these rules datapackage.json does not have to indicate what version of Data Package it uses (i.e. it is versionless). Implementations have no direct way of assessing the version (even though this would make it easier https://github.com/frictionlessdata/specs/issues/858#issuecomment-1909977780 it is not something that we can require from data publishers, imo).

Versioning

  1. The Data Package specification is versioned. This is new over 1.0, where changes were added without increasing the version.
  2. The Data Package specification is versioned as a whole: a number of changes are considered, discussed, added or refused and released as a new minor version.

Property changes

  1. A property MUST NOT change type
  2. A property MAY allow additional type (array) @roll you want to avoid this as a rule, but it does offer flexibility, cf. https://github.com/frictionlessdata/specs/issues/804#issuecomment-1913486995
  3. A property MUST NOT become required
  4. A property MAY become optional. Example: https://github.com/frictionlessdata/datapackage/pull/7
  5. A property MUST NOT add enum
  6. A property MAY remove enum. Example: https://github.com/frictionlessdata/specs/pull/809
  7. A property MUST NOT remove enum values
  8. A property MAY add enum values

Table schema changes

  1. A field type MUST NOT change default format. Example: does https://github.com/frictionlessdata/datapackage/pull/23 align with this?
  2. A field type MUST NOT remove format pattern options
  3. A field type MAY add format pattern options

New properties

  1. A new property MAY make a datapackage.json invalid (because of general rule 2). Example: https://github.com/frictionlessdata/datapackage/pull/24
  2. A new property CANNOT be required

Removed properties

  1. Removing a property CANNOT make a datapackage.json invalid (because of general rule 1)
khughitt commented 8 months ago

Thanks for taking the time to put this together, @peterdesmet! This seems like a great idea..

I think it would be useful to use this as a starting point for a description of the revision process in the docs.

I'll create a separate issue so that it can be tracked separately from the issue discussion here.

fomcl commented 7 months ago

My 2 cents here:

roll commented 5 months ago

@peterdesmet Regarding provisional properties, I think we have an even more eloquent solution for example using a special Data Package Draft/Next extension (or a profile per feature) where we can test new features and ideas without actually affecting the core specs itself. Users will just need to use a draft Data Package profile to join testing.

And then if we have a established release cycle we can merge tested features in the core specs based on schedule. Actually using this approach feature development can be even decentrilized

peterdesmet commented 5 months ago

@roll sounds promising, would have to see it in action to fully understand. 😄