conda / conda-lock

Lightweight lockfile for conda environments
https://conda.github.io/conda-lock/
Other
458 stars 102 forks source link

Alphabetic ordering of packages #491

Closed baszalmstra closed 7 months ago

baszalmstra commented 10 months ago

Checklist

What is the idea?

I noticed that conda-lock.yml files are topologically sorted.

I was wondering why this was chosen over alphabetic sorting. It feels to me that alphabetically sorting the packages makes the format more stable compared to topological sorting. If a package version changes and it adds or removes a dependency this might change the order of packages in the format resulting in a large chance when viewing the diff.

Why is this needed?

It would be ideal that small changes of the lock file would also result in small changes in a diff/patch. Large changes cause reviews to skip reviewing lock-file changes while the small chance might be significant.

Since the dependencies of packages are also stored, the topological sorting of packages can relatively easily be reconstructed from the file itself (Mamba already does this, rattler doesn't need it).

What should happen?

Instead of sorting the packages in the file topologically we sort them by name, by platform, by version. I think this will result in the smallest possible diff when:

Which I think are the most common causes of change.

Additional Context

Would love to hear your thoughts! :)

ismail commented 10 months ago

The main problem with the current setup is that when you add a new package the lock package is shuffled, this makes git diff unusable. Usually, the diff is so big that GitHub won't show it unless you manually load it. Of course, even if you load it manually, making sense of the changes is not easy at all. I'd say there is a usability angle to this, but also a security angle.

With an alphabetically sorted setup, it's a breeze to audit the lock file changes. I hope you consider and accept this proposal.

maresb commented 10 months ago

I suspect that this is an artifact of the old explicit format, where the package URLs are listed without the dependencies. And because the dependency info is not included with the explicit format, the only way to convey the correct installation order is to topologically sort the dependencies beforehand.

With the new format, since topological sorting is fast and easy, I see no reason to do it beforehand. (@mariusvniekerk, please correct me in case I'm missing something.) I'd be happy to consider a PR. (I think it should be really trivial.)

FelixSchwarz commented 9 months ago

I think this was implemented in 70642a832ea4aaf9e24b8ff40ff65ffb35f3b157 with a few related tickets: #181, #184, #170

However I also would like to see alphabetical sorting - this would make reviewing diffs much easier.

maresb commented 9 months ago

Thanks a lot @FelixSchwarz for looking through the history. It really seems like toposort was chosen simply to fix the package order.

It seems to me like as long as we ensure that explicit lockfiles are toposorted, then there should be no problem with sorting the unified YAML format alphabetically.

Also, when toposort is computed breadth-first like here, then we could easily track the toposort_depth as a nonnegative integer. Then we could trivially recreate the current hybrid topological/lexicographical ordering by sorting on the key (toposort_depth, name). (To be explicit, toposort_depth==0 for packages with no dependencies, toposort_depth==1 for packages which only depend on packages with no dependencies, and so on.)

mariusvniekerk commented 9 months ago

So one thing that is done with the sorting is that it explicity uses the same topographical sort as conda so that the behavior for ties in topographical orderiung is the same as for conda. That was largely why this was done. The format does likely contain enough information to perform this sort on-demand now

mariusvniekerk commented 9 months ago

It should be relatively simple to perform alphabetic (maybe (platform, manager, package)) ordering on serializing the Lockfile to disk and to just call https://github.com/conda/conda-lock/blob/f64b74ec91c57218bd11692b8e5cda3611d64649/conda_lock/lockfile/v2prelim/models.py#L74-L75 after loading the Lockfile from disk

maresb commented 8 months ago

We should agree on an ordering of the keys.

I actually had in mind (package, manager, platform), the idea being that if one package changes, then it will typically change similarly for each platform, and then you'd see those similar changes grouped together.

Putting platform first is more in agreement with how conda-lock works internally, but in terms of reviewability of the lockfile it'd mean that a change to a package would be spread across several locations (one for each platform) in the lockfile. And I think this issue is about increasing reviewability.

baszalmstra commented 8 months ago

With categories, is it possible to have multiple versions of the same package? Then we also have to sort by category/version/build.

maresb commented 8 months ago

@baszalmstra, nope, there's a single solution having fixed versions over all categories. Then categories allow for selecting a subset of packages within that single solution.

maresb commented 7 months ago

Thanks @baszalmstra for the excellent suggestion! This is now available in v2.5.0.