DAS-RCN / RCN_DASformat

4 stars 1 forks source link

Version type #9

Closed anowacki closed 2 years ago

anowacki commented 2 years ago

Currently (as of bf220df66b4b20b0401074f0e137e436ce80b19a), DASFileVersion is a float16. I suggest instead the version is split into two or three parts, so that semantic versioning can be followed with the file format.

The format could instead be two or three int16s, or perhaps just a string (my preference, so that you can have versions like v1.2.3-rc1), but either way a float doesn't seem like it could capture that easily.

miili commented 2 years ago

I completely agree with semantic versioning for software.

Semantic versioning is not required for file formats or APIs, which should not break between major versions.

MAJOR version when you make incompatible API changes.

This means that any v1 file can be read by any lib1.x.x. Same agreement is for API versioning, which is a single integer v1, v2 etc and non breaking between minor version.

So I think version should be a small integer.

@andreas-wuestefeld, please see link to semantic versioning above. idk why we start with version 0.9 here (seems arbitrary), and somewhere in the scripts I have read version string 1.03

anowacki commented 2 years ago

@miili I agree with you that file formats should not break between major versions.

By saying that, we are giving semantic meaning to the file version. Some file versioning schemes do not place any semantic meaning on the file version. In such a scheme, it could be that v71 files are can be read by v70 readers, or they may be incompatible. Some people use dates, which don't say how the versions actually change. I agree for our case here it makes sense that an increase in the major version number (going before the first .) semantically means that things which could read (a subset of) the old file format cannot now read the new file format. What we have arrived at here is semantic versioning! Let me explain why I think changing minor versions (the number after the first .) also makes sense for file formats.

For example, say we begin with a v1.0 file format. Some time later, we decide to add extra fields (perhaps they are optional, but they can be mandatory), but retain everything else about the file format. In this case, if readers of v1.0 files are still able to read the new-version files and obtain the same information as they did before, this new version can be labelled v1.1. We can use this style of file version numbering to signal a guarantee of what I just said—that v1.0 readers can still read the files, just not take advantage of the new 'features' or information.

People who write v1.1 readers could either choose to support v1.0 files, or drop support for v1.0 files and only read v1.1 files, perhaps because they need that new 'features' of v1.1 files. All of this is possible because we have made the major and minor release numbers semantically meaningful—in other words, used SemVer.

I think at this point for file formats it is fine to stop with major and minor versions, but in fact you can go further. Say in v1.1.0 we accidentally specify a field name incorrectly (we have a typo and add the field "smaple_rate") and introduce a 'bug'. We can release a file format version v1.1.1 which changes that to "sample_rate". This is bumping the 'patch version' Again, authors of file readers can choose to drop support for v1.1.0 files and only support files from v1.1.1 onwards if they want, but they might also want to retain support for v1.1.0 files and cope with reading the ghastly "smaple_rate" field into something better named internally when needed.

I have used SemVer for custom file formats in a few projects over the past few years, and it has worked well. One can debate whether existing file formats adhere to SemVer in all cases, but point releases for file formats with which we are familiar are used by SEED, StationXML, QuakeML and ASDF, amongst others.

As for starting at a version less than 1, there are reasons to do that, noting that a change from v0.1 to v0.2 is a breaking change in SemVer. If we begin at v0.1.0 (I agree starting at v0.9 would be weird, implying minor versions before it), that implies a preliminary file format which is subject to frequent change, which may free people to iterate to a better long-term format more quickly. A stable format would be signified when v1.0.0 is released. On the other hand, adoption may be slower if we begin with v0.1 as people may wait until v1 arrives before using it or writing readers/writers.

My own view would be that we should start with v0.1 when some basic level of consensus has been reached, then try that out, and let experience drive quick changes to v0.n, then again quickly release v1.0 if that feels ready without worrying too much if the format will need to change again.

andreas-wuestefeld commented 2 years ago

Thanks Andy

You covered most of my reasoning that I failed to explain, and added good arguments floats were chosen for easy boolean comparison (if version < 0.9 then...). But the 0.9.1 style for patches sound like a good idea adding "rc1" style seems overkill for a file format :-) especially one that is potentially short lived.

Vers 0.9 was intended to show "nearly ready for release" Version 1.0.0 should be the one used in the global DAS month. Release Date will be shortly after Dec 15th 2022, the day the commenting period is over. People need time to implement writers...

I propose to implement the three-digit style, starting at 0.1.0 as the format is now. Then I will implement some changes as suggested in other issue threads. I may bundle several changes into one version number change :-)

andreas-wuestefeld commented 2 years ago

implemented. open for comments

miili commented 2 years ago

Well, the point is that a major version is always compatible! Even when you add addititonal fields. If it is incompatible you bump it up. If the versions are compatible you just don't bump the major version.

For all major software this is the consensus. Schemas and models have only have a major version, which is compatible and non-breaking between major versions (e.g. 1.x.x) When you add functionality or fields it can still be compatible, and it will still be version 1.

@anowacki, your proposal is not compatible with semantic versioning. By definition it is MAJOR.MINOR.PATCH-SUFFIX

Look at this:

import semver

semver.VersionInfo.parse('1.1')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a6eca5e37153> in <module>
----> 1 semver.VersionInfo.parse('1.1')

~/.local/lib/python3.9/site-packages/semver.py in parse(cls, version)
    724         match = cls._REGEX.match(ensure_str(version))
    725         if match is None:
--> 726             raise ValueError("%s is not valid SemVer string" % version)
    727 
    728         version_parts = match.groupdict()

ValueError: 1.1 is not valid SemVer string

My propopsal is that we start with version 0 until a consensus is reached.

anowacki commented 2 years ago

Well, I'm happy to go with patch numbers as well if preferred, though I was writing 'v1', 'v1.0', etc., for brevity. I don't use Python much and wasn't aware of the semver library, so thank you for pointing to that; though perhaps any one library need not be taken to be the canonical source for version number parsing. For example, Julia's type VersionNumber in Base will handle missing minor and patch numbers, taking them as 0:

julia> VersionNumber("1")
v"1.0.0"

(Missing -suffixes are assumed not to exist—not sure if that's the case for the Python semver library or not.)

I think we are really saying the same thing—major version changes mean backwards-incompatible, breaking changes, and backwards-compatible, non-breaking changes can be made without changing the major version number. Not everyone really uses SemVer (e.g., Linux, NumPy), but I agree it makes more sense than most other schemes, and I'm happy that we all think it's a good scheme to use for software at least. For this particular file format, it doesn't seem critical in any case. Thank you both for the productive discussion.

miili commented 2 years ago

If minor versions are always compatible, I don't see the point for semver in the schema. A simple integer for major is enough.

Here is an interesting discussion https://stackoverflow.com/a/27901741/2387835

All other changes (including new features, bugfixes, patches etc.) should be 'safe' for your consumers. Those new features don't have to be used by your consumers, and you probably don't want to continue to run that unpatched version that contains bug X or Y any longer than necessary

anowacki commented 2 years ago

(I think Andreas has indicated that this repo is meant to design a temporary file format only for the community DAS experiment in February, so I don't hold very strong views on any of this now.)

That makes sense for a web service, where there is one existing implementation of that web service in operation, such as GitHub's web API. If you update the service in a backwards-compatible way, then the users of the service don't need to know that something has changed. New users of the service can take advantage of the backwards-compatible additions from today onwards and not care about the fact that they didn't exist in the past, because there is no way to obtain a respone from the service which does not include the new 'features'.

File formats are different I feel. Multiple files of multiple versions may be present at any one time (e.g., on your hard disk) and they do not stop having their particular (backwards-compatible) features after some date automatically, in contrast to a single remote service. Files may be provided from multiple different places, each of which may provide a different file version.

Given that, I can see one can argue that any change is incompatible and so you only ever need a major version number. I think that's fine as well. Why in practice things like QuakeML, StationXML and SEED have opted for minor version numbers to accommodate additions instead I couldn't say for sure.

andreas-wuestefeld commented 2 years ago

The important thing is that there is a version number at all. I think the three digit system is a good compromise. I don't anticipate too many changes anyway after version 1.0.0 Hopefully the official IRIS TilsDB format will be available soon after

I am closing this issue so we can focus on more pressing issues in the format

miili commented 2 years ago

The important thing is that there is a version number at all. I think the three digit system is a good compromise. I don't anticipate too many changes anyway after version 1.0.0 Hopefully the official IRIS TilsDB format will be available soon after

This is exactly the argument for only major version integer :smiley: