Versioning, breaking changes and public API

I'd like to discuss our version schema and the way we move forward with breaking changes. In it's current state we have a stable release but we frequently want to introduce minor interface breaking changes which would, according to semantic versioning, require us to increase the major version. While I'm not opposed to increasing our version, I'd like to signal users when they actually should expect something to break, not when, technically speaking, a minor part of the API changes which shouldn't have an impact on users. We have a lot of public API atm which I wouldn't recommend for external usage (e.g. classes like DatasetBuilder, SchemaWrapper, etc.)

We currently have two versioning schemas:

The specification versioning via the metadata_version
The package version to signal breaking python API changes

I believe we need a better way to protect us from too much exposure as well as to protect users from unnecessary breakage.

I can currently come up with a few ways to handle this issue:

Don't change anything. Stick strictly to semver and increase the versions accordingly. Try to keep breaking releases to a minimum (e.g. once a quarter or less often)
Introduce api submodule for our modules which will be clearly flagged as the part of the external API where we guarantee no breakage between major releases. I wouldn't introduce this for the io submodule, though.
Change versioning schema to something like https://calver.org (open for other suggestions. this is just one which is rather common) and clearly document changes if something breaks

I would opt for solution 2 (unstable api submodule) for the following reasons:

We have exposed many semi-private apis to the consumers which I think are helpful for alternative IO implementations or deeply integrated software (like CLIs), but are normally not be used by the average user. This part I would also move to api and declare it as semi-stable (non-breaking for fix-releases, potentially breaking for minor version changes). Having a proper semver breaking release for every api-breakage would just kinda hinder the communication with our users, since they have to read the changelog all the time to know if this is harmless or if we just rewrote entire kartothek from scratch.

I would not use calver, because:

The use-cases described in the linked document do NOT apply to us.
We kinda just opt out of everything are go back to the stone-age of wild, non-semantic versioning. No matter what we promise, having proper (but flexible, see api discussion) rules means at least we try really hard to not be anarchic.

Let's also keep in mind that semver is a guideline, not a law. Many bugfixes out there could in theory be considered a breaking change, but that's obviously unpractical. Same goes for feature releases where someone extends function signatures with additional optional parameters. Depending on the concrete user code, this can be breaking for some users. I think we should rather ask: "how often does a non-major release break user code and was the concrete case an accident or a taken risk?"

I would not use calver, because

Completely agree. This was the very first versioning schema I had in mind and I at least wanted to put the notion of leaving semver behind on the table.

Having a proper semver breaking release for every api-breakage would just kinda hinder the communication with our users [...] Let's also keep in mind that semver is a guideline, not a law.

Indeed, I just want to increase transparency for us and the users to clarify what and how we're doing it and document this somewhere. I believe we haven't even documented what the metadata_version actually means...

This part I would also move to api and declare it as semi-stable (non-breaking for fix-releases, potentially breaking for minor version changes)

My idea was rather to guarantee stability for things we added to the api and only break them in major releases. For interfaces not part of the api modules we wouldn't give any strong guarantees.

Either way, this strongly depends on what would be part of the api module. In the end I'd like to come up with something like (list not exhaustive!):

major version

Remove deprecated functionality (including metadata versions)
Remove modules
Change interface and behavior of api and io modules
All included in minor changes

minor versions

Add new metadata versions
Add or extend functionality in a backwards compatible way
Change interface and behavior of non-stable (not api, not io) modules
All included in patch

patch

Everything not user facing

metadata version

All specification related changes will increase the integer. This includes:

Key encoding
Schema storage spec
Internal metadata spec
Index storage format
...

JDASoftwareGroup / kartothek