Closed baszalmstra closed 7 months ago
The main problem with the current setup is that when you add a new package the lock package is shuffled, this makes git diff
unusable. Usually, the diff is so big that GitHub won't show it unless you manually load it. Of course, even if you load it manually, making sense of the changes is not easy at all. I'd say there is a usability angle to this, but also a security angle.
With an alphabetically sorted setup, it's a breeze to audit the lock file changes. I hope you consider and accept this proposal.
I suspect that this is an artifact of the old explicit format, where the package URLs are listed without the dependencies. And because the dependency info is not included with the explicit format, the only way to convey the correct installation order is to topologically sort the dependencies beforehand.
With the new format, since topological sorting is fast and easy, I see no reason to do it beforehand. (@mariusvniekerk, please correct me in case I'm missing something.) I'd be happy to consider a PR. (I think it should be really trivial.)
I think this was implemented in 70642a832ea4aaf9e24b8ff40ff65ffb35f3b157 with a few related tickets: #181, #184, #170
However I also would like to see alphabetical sorting - this would make reviewing diffs much easier.
Thanks a lot @FelixSchwarz for looking through the history. It really seems like toposort was chosen simply to fix the package order.
It seems to me like as long as we ensure that explicit lockfiles are toposorted, then there should be no problem with sorting the unified YAML format alphabetically.
Also, when toposort is computed breadth-first like here, then we could easily track the toposort_depth
as a nonnegative integer. Then we could trivially recreate the current hybrid topological/lexicographical ordering by sorting on the key (toposort_depth, name)
. (To be explicit, toposort_depth==0
for packages with no dependencies, toposort_depth==1
for packages which only depend on packages with no dependencies, and so on.)
So one thing that is done with the sorting is that it explicity uses the same topographical sort as conda
so that the behavior for ties in topographical orderiung is the same as for conda. That was largely why this was done. The format does likely contain enough information to perform this sort on-demand now
It should be relatively simple to perform alphabetic (maybe (platform, manager, package)) ordering on serializing the Lockfile to disk and to just call https://github.com/conda/conda-lock/blob/f64b74ec91c57218bd11692b8e5cda3611d64649/conda_lock/lockfile/v2prelim/models.py#L74-L75 after loading the Lockfile from disk
We should agree on an ordering of the keys.
I actually had in mind (package, manager, platform)
, the idea being that if one package changes, then it will typically change similarly for each platform, and then you'd see those similar changes grouped together.
Putting platform
first is more in agreement with how conda-lock
works internally, but in terms of reviewability of the lockfile it'd mean that a change to a package would be spread across several locations (one for each platform) in the lockfile. And I think this issue is about increasing reviewability.
With categories, is it possible to have multiple versions of the same package? Then we also have to sort by category/version/build.
@baszalmstra, nope, there's a single solution having fixed versions over all categories. Then categories allow for selecting a subset of packages within that single solution.
Thanks @baszalmstra for the excellent suggestion! This is now available in v2.5.0.
Checklist
What is the idea?
I noticed that
conda-lock.yml
files are topologically sorted.I was wondering why this was chosen over alphabetic sorting. It feels to me that alphabetically sorting the packages makes the format more stable compared to topological sorting. If a package version changes and it adds or removes a dependency this might change the order of packages in the format resulting in a large chance when viewing the diff.
Why is this needed?
It would be ideal that small changes of the lock file would also result in small changes in a diff/patch. Large changes cause reviews to skip reviewing lock-file changes while the small chance might be significant.
Since the dependencies of packages are also stored, the topological sorting of packages can relatively easily be reconstructed from the file itself (Mamba already does this, rattler doesn't need it).
What should happen?
Instead of sorting the packages in the file topologically we sort them by name, by platform, by version. I think this will result in the smallest possible diff when:
Which I think are the most common causes of change.
Additional Context
Would love to hear your thoughts! :)