Remove the `content_hash`s?

baszalmstra commented 1 year ago

Checklist

[X] I added a descriptive title
[X] I searched open requests and couldn't find a duplicate

What is the idea?

While working on porting the conda lock format to Rust I'm having a tough time implementing the content_hash. The implementation relies heavily on the serialization of the model to json which is hard to get precisely compatible with conda-lock.

In the broader sense, I wonder what the use of content_hash is. Is it to verify the integrity of the file?

Why is this needed?

To make integrating the conda-lock file format in other languages much easier.

What should happen?

Basically, drop the content_hash field from the format.

Additional Context

No response

maresb commented 1 year ago

Ya, I think it's mainly to ensure that the file hasn't been hand-modified. Personally, I vaguely think it's good, but I don't have strong feelings. I think @mariusvniekerk feels a bit more strongly.

See https://github.com/mamba-org/mamba/issues/1209#issuecomment-944521460 for more context

baszalmstra commented 1 year ago

I think that discussion is more about the input specification than about the content of the lock file itself. The content-hash is currently a hash of the LockedDependencys if I understand it correctly.

Additionally, I also don't think including a hash of the input provides much value. First of all you can easily check if the contents of the lock-file is "valid" for a given input. Simply check for each input matchspec if there is a locked package in the lockfile that matches it. You can even validate that the entire lock-file is consistent by also traversing the dependencies of the locked packages.

Secondly, only hashing the input specification is not enough to create a reproducible conda-lock file since the repodata or the micromamba version and stuff like that are not included.

mariusvniekerk commented 1 year ago

The input_hash comes somewhat from a time before mamba where recomputing a lock was very expensive and this was a cheap way to know whether to even bother attempting to recompute it.

This is also handy in some ci workflow that make git commits if the input spec for a lock has changed from a prior attempt

baszalmstra commented 1 year ago

@mariusvniekerk Are you talking about the input_hash or the content_hash. If I read it correctly the content_hash is not based on the input but based on the actual locked-down packages, thus simply computing the sha256 of the entire file would likely yield similar results.

The input_hash comes somewhat from a time before mamba where recomputing a lock was very expensive and this was a cheap way to know whether to even bother attempting to recompute it.

You could also very cheaply do the same thing by checking if the input already matches the locked-down packages. This can all be done in Python without ever touching a solver.

My concern is that computing hashes is a fragile thing. If you add/remove a field, or format it slightly differently the hashes will not match. This makes it very hard to reproduce in a different environment or in another language or iterate on the design. I think doing semantic checks over checking hashes is a more stable approach.

I would prefer to remove this rather complex piece from the lock file altogether in favor of keeping the file format simple and easily exchangeable with other implementations. WDYT?

baszalmstra commented 1 year ago

Pulling the discussion of #449 in here.

@AlbertDeFusco

I've looked at the arguments in https://github.com/conda/conda-lock/issues/432. As the maintainer of conda-project I find value in having the hash in the conda-lock file. The use case is that a user creates a project with locked dependencies. When they re-visit the project several years later conda project install will check that the hash is still valid and install the packages as they were rather than re-lock before installing. But if the environment.yml had changed the user will be shown a warning.

Given the information in the lock-file one can very easily verify whether or not the specs in an environment.yml are still satisfied by the locked packages in a lock-file. Even without the hash in the lock-file people can still years later install from the lock-file without having to re-lock any of the dependencies. I may be wrong here, but I think this covers most use-cases.

The only thing the content-hash provides is to know whether the environment.yml matches exactly with the original environment.yml file it was created from. But in that case doesn't it make more sense to just include the specs? Instead of using an unstable hash?

If even one thing changes in the years after the lock file was created and you use an updated conda/mamba/whatever install the content-hash will not match anyway!

I'm very curious to everyone's thoughts!

AlbertDeFusco commented 1 year ago

Actually, I have given what you say some thought and I think there would be value in retaining the "original requested spec" for packages listed in the lockfile. I've been meaning to make a PR for it.

maresb commented 1 year ago

Even though I already linked it at the top, I think it may be worth repeating how much this reminds me of https://github.com/mamba-org/mamba/issues/1209#issuecomment-944822828

maresb commented 1 year ago

This question has been kicking around in the back of my head, and I finally had an idea that may be worth sharing.

I am making the assumption that the idea of content_hash is to be able to determine whether or not the lockfile is a lie, then I think that using a hash for this purpose is a suboptimal approach.

What we really want to compute is a function is_consistent(environment_spec, lockspec) -> bool, the purpose being to determine if we want to relock in the case we are reluctant to do so. An implementation like the pseudocode

return compute_hash(environment_spec) == lockspec.content_hash

is only an approximation to the is_consistent function, since with such a solution there are always equivalent but unequal environment specs with different content hashes.

It seems to me that the "right" way to do this would be something like the following pseudocode:

# check if lockspec satisfies environment_spec
if not environment_spec.is_satisfied_by(lockspec):
    return False

# check that lockspec is minimal
for pkg in lockspec if pkg.is_root_node_in_dependency_dag:
    if environment_spec.is_satisfied_by(lockspec.minus(pkg)):
        return False

# lockspec satisfies environment_spec and is minimal.
return True

TANGENTIAL NOTE: I think it would be really helpful for the next lockfile version to contain a field to indicate which dependencies are non-transitive, and perhaps also the corresponding spec from the environment spec that they fulfill.

SUBNOTE: On the other hand, this might be too much information; if a lockfile is already consistent with the environment spec, but the environment spec has changed (in a compatible way), then if we are including the spec in the lockfile, this would mean that the lockfile should probably be updated, even if the locked packages are the same. (An example I have in mind is using a conda-lock.yml with a Dockerfile where we want to avoid busting the cache.)

baszalmstra commented 1 year ago

@maresb Yes that is exactly my point! Thanks for sharing!

This is precisely what we implemented in pixi. We check if the lockfile is minimal by walking the dependency graph from the specs using the locked packages in the lock file. If we dont visit a package in the lock file we now its surplus and we mark the lock file as inconsistent.

If the spec is updated but still compatible with the lockfile we dont update the lock file.

If relocking is required we resolve the environment while keeping the previous lockfile as the “installed” packages which ensures lockfiles are minimally updated. This would simply remove surplus packages without changing anything else.

We can do all of this without the content-hash.

conda / conda-lock