RFE: Backend based version comparison

bkabrda commented 8 years ago

So I've been investigating why my local instance of Anitya claims that the latest version of Django is 1.10b1 when 1.10 has already been released. I found out that the underlying issue is that versions of all backends are compared using RPM version comparison. For example:

$ rpmdev-vercmp 1.10b1, 1.10
1.10b1, > 1.10

So this explains why Django 1.10b1 is rated as newer release than 1.10. It would be cool if Anitya implemented version comparison algorithm per-backend, since every backend has some different corner cases that compare versions differently than RPM version comparison (prereleases manifest this most often).

This is a tough nut to crack, since version comparisons would need to be implemented for all backends properly. Perhaps we could start by implementing Python-specific version comparison and let all other backends fall back to the RPM based one? If this sounds good, I'll try to give it a shot and send a PR.

Also I've been using Anitya for several days right now and everything is pretty smooth, so let me thank you for writing a nice piece of software :)

pypingou commented 8 years ago

I would much rather use one version comparison for all. The human brain is the most creative part of the universe when it comes to version and I don't think a per backend approach would bring much except for complexity in the code base.

What you read here in 1.10b1 as the beta 1 of 1.10, some other will read it as 1.10 post release b1. For example: https://pypi.python.org/pypi/straight.plugin/ claims the latest version is 1.4.1 while it actually us 1.4.1-post-1 (cf https://github.com/ironfroggy/straight.plugin/issues/17).

My take is that versions are a mess and I doubt we will ever be able to be flexible enough to cover all of them. One thought which I have is to try porting from rpm to the python module implementing https://www.python.org/dev/peps/pep-0440/

bkabrda commented 8 years ago

I totally understand and agree with what you're saying in regards to using one version comparison for all (and the additional complexity).

I think the issue with the straight.plugin version is just a bug in PyPI code, which isn't all that important since pip will do the right thing and people are busy working on Warehouse rather than fixing old PyPI code.

I also agree that versions are a mess... Nevertheless, I still think this would be worth the effort, assuming we could get a library that would do this for us and we'd just call sth. like compare_versions('pypi', '1.10', '1.10b'). So my proposal is to keep it open, because I may actually do that (or someone from my current team, we're actually pretty interested in this :))

pypingou commented 8 years ago

I'm fine with leaving this open for further discussion but I am not convinced currently (not saying that I couldn't be) :)

ncoghlan commented 8 years ago

I don't think per-backend version comparison makes sense, but making version comparisons configurable was one of the things I had in mind when adding the ecosystem plugins to define per-ecosystem name uniqueness - PyPI, npm, CPAN, etc do define precisely what "the latest" version means in terms of their version specifiers.

Between SemVer, CalVer, and PEP 440, future language ecosystems should hopefully avoid adding too many new variants for version numbering and ordering, so I don't think this is an unbounded complexity problem - I think there's only a handle of schemes in sufficiently widespread use for it to be helpful for Anitya to support them.

kentfredric commented 8 years ago

future language ecosystems should hopefully avoid adding too many new variants for version numbering and ordering

Unfortunately, only Perl is really adequate to compare perl versions.

Presently you've got version comparisons completely backwards for Perl, assuming that 1.41 is larger than 1.5, because your logic assumes dotted decimal is something everyone uses.

But Perl has 2 version schemes, and which one you get depends on what characters you use and how version.pm parses them.

1.1 # single . , no leading "v" is always treated as if it was a float, so 1.1 == 1.100
1.2.3 # multiple ., no leading "v" is always treated as a decimal-separated list of integers ( technically char's but lets not think too hard about that )
v1.2 # single '.', leading "v" is always treated as decimal-seperated ints
v1.2.3 # multple '.', leading "v" -> decimal seperated ints.

the form "1.2.3" is however in "grey" area and is discouraged in favour of the stricter "v1.2.3"

And how you compare 2 such versions is an entirely separate discussion :/

For instance, any version with an "" gets extra special treatment, because "" means "Alpha"

And there has also been a recent change in how "_" is treated.

Until recently, 1.1_1 could have meant the same as 1.1.1 shudder, because somebody thought it was smart to treat _ like a dot, despite Perl itself not.

Or it could have meant "1.11 with hidden alpha bits" where 1.1_1 cmp 1.11 did not return 0

Now, at least in the "floating point" forms, "_" is "ignored as if it wasn't there" for the sake of comparisons, and so now sanity prevails and 1.1_1 cmp 1.11 returns 0.

Basically, I do find its rather foolish not to have some sort of backend specific handling for perl, because its a nightmare in its own, let alone foisting that nightmare onto every other package (which is by design completely incompatible)

I would also prefer where possible to delegate version decoding, normalising and comparison directly to perl's version.pm where possible, but I understand if that is not the case.

But either way, there's a fair bit of technical reading that needs to happen here:

http://www.dagolden.com/index.php/369/version-numbers-should-be-boring/

And the test suites for Perls' version.pm hopefully show its not exactly as obvious as one might think.

https://v1.metacpan.org/source/JPEACOCK/version-0.9917/t/10_lyon.t https://v1.metacpan.org/source/JPEACOCK/version-0.9917/t/04strict_lax.t https://v1.metacpan.org/source/JPEACOCK/version-0.9917/t/coretests.pm

There's also the recommendation from upstream that instead of burdening downstreams with the warts of versions in perl, that they should normalise upstreams versions into a scheme that is consistent with their vendors version comparisons.

Subsequently, I have a reference implementation that we use in Gentoo that uses Perl's version.pm to generate equivalent Gentoo compatible versions in the x.yyy.zzz form:

https://v1.metacpan.org/pod/distribution/Gentoo-PerlMod-Version/lib/Gentoo/PerlMod/Version.pm

And it also sports a CLI utility for convenience:

https://v1.metacpan.org/pod/distribution/Gentoo-PerlMod-Version/bin/gentoo-perlmod-version.pl

Hopefully my test suite will show how surprising Perl versions can be: https://v1.metacpan.org/source/KENTNL/Gentoo-PerlMod-Version-0.8.0/t/01_basic.t

jeremycline commented 7 years ago

I've started on this in #448. Rather than per-backend, it's per-project. Please let me know what you think!

ncoghlan commented 7 years ago

Could this be per-ecosystem rather than per-project?

I don't think we really want to encourage people defining their own ad hoc version ordering schemes, and ecosystems typically define a "how to order versions" mechanism in addition to defining a shared package namespace.

ncoghlan commented 7 years ago

I think that may also address the concerns @pypingou raises above - by default, everything would still assume RPM-style comparisons, but ecosystems could optionally define two hooks:

how to order versions according to the upstream rules
how to generate an RPM-style version that sorts the same way the upstream ones do

(the latter is technically out of scope for Anitya itself, but would make sense if these features were eventually moved out to a helper library that could also be used by package generation tools like pyp2rpm)

kentfredric commented 7 years ago

Could this be per-ecosystem rather than per-project?

"Backends" are essentially "Ecosystems" at present, just its in terms of the mechanism the code is accessed, eg, if its sourced from the "cpan" backend, then its Perl, if its from the pypin backend, its the python backend.

The problem here though is there's no clear association between backend and ecosystem for all languages, so anything that distributes primarily through github is vague.

Hence, doing it at the project level means you still have that variability.

But that means you can still save a lot of effort by making certain backends have a default version scheme, which is the best option for backends like CPAN, but can still be changed when its wrong, which is a good compromise for the diapsora of arbitrary backends.

And its not "encouraging" everyone to have their own ad hoc versioning schemes, because instead of being a free-form field, its a drop down where you pick the scheme from a list of known ones.

( It could never work as a field anyway, because CPAN versioning scheme requires dozens of lines of code to even implement a basic implementation of , and is certainly outside the scope of anything that could ever be done with a regex )

how to generate an RPM-style version that sorts the same way the upstream ones do

Thats sort of begging for an ability for vendors to submit their own python module that handles the translation for that vendor.

Vendor function could take [backend,identity, version, vendor-identity] and compose them into output version-idenity + normalised version strings.

( This again something very necessary with CPAN, esp for Gentoo, because of the aforementioned reasons where we use normalised versions everywhere, and its confusing for outsiders to look at anitya and see us tracking different versions than you )

ncoghlan commented 7 years ago

@kentfredric Backends and ecosystems are already modelled separately: https://github.com/release-monitoring/anitya/tree/master/anitya/lib/ecosystems

While there's a bit of hackery [1] to avoid exposing that in the web UI for the time being (by having backends define a "default ecosystem"), the distinction is already visible in the fact that:

the database enforces per-ecosystem uniqueness of project names
the REST API allows project information to be looked up by ecosystem (which I just noticed was still missing from the live API docs: https://github.com/release-monitoring/anitya/pull/451 )

However, the "direct from VCS tag" release model is an interesting one, as is the question of tracking version schemes for the "custom" backend, so you're probably right that it makes sense to model version schemes as their own kind of entity. That would then enable:

ecosystems (and by extension, backends) declaring a default version scheme
picking a version scheme explicitly when using a backend that doesn't declare a default ecosystem
downstreams (once they're modelled as more than just text strings) having a declared version scheme (rather than assuming RPM style versioning, which is correct for Fedora and derivatives, but not for everyone else)

Allowing downstream plugins to also define translations from upstream version schemes to downstream ones would then be a problem for later.

[1] https://github.com/release-monitoring/anitya/issues/400 is a problem with that hackery when the backend is changed for a project, since it keeps its previous ecosystem setting in that case

jeremycline commented 7 years ago

Could this be per-ecosystem rather than per-project?

I don't think we really want to encourage people defining their own ad hoc version ordering schemes, and ecosystems typically define a "how to order versions" mechanism in addition to defining a shared package namespace.

What about ecosystems define a default version scheme, but it can be overridden at a per-project level? No matter what we do, there will be crazy projects in ecosystems that don't play ball with the version scheme, right? The way I implemented it in #448 is the ProjectVersion table grew a type column and there's a dropdown in the UI based on the supported types.

I should also say that I'd really like Anitya to not be so dependent on correctly ordering the version. We get it wrong plenty today, and we'll get it wrong sometimes no matter what we do. Right now we only record an upstream version when it's determined to be "newest" (by some finite set of sorting algorithms), but I think we should record all upstream versions we discover*. Sorting then becomes a thing in the UI, but getting it wrong doesn't mean we fail to announce a new version. This would be nice because plenty of projects have an LTS release or similar and maintainers might want to hear about new LTS releases and the latest-and-greatest from master.

All that's to say I think we can't be correct for every project. Instead, I'd like to provide sane defaults that mostly work and give the user plenty of tools to improve the situation for individual projects.

* There are plenty of opportunities here to cater to individual projects - optionally ignoring pre-releases if the version scheme supports it, ignoring tags/releases that aren't parsable by the version scheme, etc.

kentfredric commented 7 years ago

Sorting then becomes a thing in the UI, but getting it wrong doesn't mean we fail to announce a new version.

That becomes a question how it maps to the fedmesg and fmn stuff. People are probably going to want a way to filter "Only the new releases because we don't care about LTS stuff" or want a way to discriminate LTS-releases from "fresh releases" in the data.

And that necessitates some sort of server-side disambiguation.

kentfredric commented 7 years ago

There are plenty of opportunities here to cater to individual projects - optionally ignoring pre-releases if the version scheme supports it, ignoring tags/releases that aren't parsable by the version scheme, etc.

Yeah, you're possibly going to want version-scheme specific ways of identifying the "class" of a version release, be it a "stable" release or a "Development" release. ( For instance, CPAN versions inject "_" in the verisons or use a '-TRIAL' suffix to indicate a development release, but the current source anitya uses for the versions pre-excludes these versions, partly for other complicated reasons around authority and indexing )

ncoghlan commented 7 years ago

The trick @bkabrda came up with for the project that prompted this RFE was to report two fields in API responses: "version" (i.e. the version this response is about) and "latest_version" (i.e., the latest version the server knows about).

For Anitya, that would be enough to provide basic categorisation of "latest releases" vs "maintenance releases" as:

a "latest release" is a never-before-seen release that has a higher version than the previous latest known version ("version" and "latest_version" are identical in announcement message)
a "maintenance release" is a never-before-seen release that has a lower version than the current latest known version ("version" and "latest_version" differ in announcement message)

That would then enable the following approach to announcing more releases without breaking backwards compatibility for existing consumers:

keep the current reporting policy of "latest non-development releases" for the existing anitya.project.version.update topic. All messages in this channel would have "version" and "latest_version" the same.
add a new topic anitya.project.version.maintenance_update. All messages in this channel would have differing "version" and "latest_version" fields.
if backends start gaining the ability to report pre-releases in addition to stable releases, add a new topic anitya.project.version.prerelease_update. All messages in this channel would also have differing "version" and "latest_version" fields, but in this case it would be due to "latest_version" only tracking releases, not pre-releases.

jeremycline commented 7 years ago

That becomes a question how it maps to the fedmesg and fmn stuff. People are probably going to want a way to filter "Only the new releases because we don't care about LTS stuff" or want a way to discriminate LTS-releases from "fresh releases" in the data.

Yup, I definitely don't want people getting spammed with stuff they don't want. One thing I've got bouncing around in my head, though, is how to get Anitya/the-new-hotness ready for a world with multiple supported streams in Fedora. For example, if we know a project follows semantic versioning we can open dist-git pull requests against the correct branch when a Z release happens.

jeremycline commented 7 years ago

@pypingou @ncoghlan I've been looking at how to make ecosystems define a default version scheme, and I'm curious about the history of the Backend and Ecosystem database model. I'm not sure I see the value of having tables for either one. The fact that they're there is making this change much more complicated than I think it needs to be since I'm really just trying to make sure the database table matches a couple of Python classes we've explicitly defined.

pypingou commented 7 years ago

So the backend were basically a way to categorize the regex. Instead of having each project write the same regex, we would store them per backend and allow people to use them, they were basically pre-approved/pre-set regex used to retrieve the versions.

When we moved from cnucnu to anitya, we were able start using the API of the hosting plateforms, so the plugin system instead of just relying on different regex started to do a little more and query the API to retrieve the versions.

The idea of ecosystem on the other hand is rather recent and introduced by @ncoghlan and I'm not entirely sure we started using it for its purpose.

ncoghlan commented 7 years ago

For ecosystems, the main benefit the database currently provides is to power the "lookup by name in ecosystem" API: https://release-monitoring.org/api/by_ecosystem/pypi/requests

While that's currently handled as a foreign key lookup, nothing immediately comes to mind that couldn't work just as well with a simple string field holding ecosystem names.

Having the info about default backends and version comparison schemes just living client-side in the plugins would likely be easier in many cases, since we wouldn't need to worry about schema updates and data migrations when we change it.

ncoghlan commented 7 years ago

As far as the "Why do Ecosystems have a table at all?" question, the real answer is "Because Backends do". I wasn't especially familiar with the Anitya code base when I added the by_ecosystem query API, so I didn't question the plugin+DB table approach, I just copied it :)

jeremycline commented 7 years ago

Okay then, I opted to spend some time removing them rather than writing a migration for them.

ncoghlan commented 7 years ago

Looking at the Travis failures on #459, it seems to me that this RFE could also be a good opportunity to remove the hard dependency on the distro-specific RPM bindings, and add test skips to the backends that actually need it.

ncoghlan commented 7 years ago

I filed https://github.com/release-monitoring/anitya/issues/460 to cover getting the tests passing in Travis.

jeremycline commented 7 years ago

@ncoghlan yup, I added the Travis stuff anticipating #448 would get solve the RPM problem (since it removes the dependency completely), but I like the look of #460 and I can later pull it into #448 and RPM can be a completely optional dependency.

kentfredric commented 6 years ago

This is just an example I stumbled over today of it handling versions wrong for CPAN:

https://release-monitoring.org/project/7606/ https://metacpan.org/release/CBOR-XS

Monitor thinks 1.41 is the latest ( 2016-02-25 ), when the latest is 1.7 ( 2017-06-27 )

This is "correct" as far as Perl versioning is concerned, but grossly surprising for anyone who interprets a version as dotted-decimal instead of a float.

carlwgeorge commented 6 years ago

This would be nice because plenty of projects have an LTS release or similar and maintainers might want to hear about new LTS releases and the latest-and-greatest from master.

One thing I've got bouncing around in my head, though, is how to get Anitya/the-new-hotness ready for a world with multiple supported streams in Fedora.

That world is now, with the recent release of Fedora 28 Server with modularity. In addition to tracking the streams for modularity, there are also third party repos such as the IUS project that contain packages tied to an upstream branch (i.e. redis32u will always be redis 3.2.x). I would love to use anitya to track releases for IUS, but it's not feasible until it can notify me of new releases that are lower than the absolute latest.

Since that's not the purpose of this issue, is there another issue to track discussion around this feature?

EDIT: I dug a little deeper, and found #269.

fedora-infra / anitya

RFE: Backend based version comparison #332