Validate that a conda-lock file "agrees" with its sources without triggering a solve

MattF-NSIDC commented 11 months ago

Checklist

[X] I added a descriptive title
[X] I searched open requests and couldn't find a duplicate

What is the idea?

I'd like to be able to validate that a lock file has been created from the current source specifiction's contents. So for example, if someone changes environment.yml and opens a PR, my CI could very quickly catch that the lockfile wasn't updated by re-calculating a hash and comparing it to content_hash in the lockfile.

conda-lock lock --check-input-hash is very close to what I want to do, with the downside that instead of failing immediately if the hash validation fails, it will solve the environment and update the lockfile. What about a --check-input-hash-without-solve or similar flag for this behavior?

Why is this needed?

This helps projects automatically catch and quickly prompt (by e.g. failing CI job or bot comment) contributors who have forgotten to lock the environment after changing it. This repo2docker-based project, for example, expects users to manually update their lockfile or use /condalock in a PR comment to trigger a bot to update the lock file, but they have to remember to do that!

Detecting that environment.yml has changed isn't enough. Sure, if conda-lock.yml also hasn't changed, we can be relatively assured that the user forgot to update the lock-file. But if conda-lock.yml has changed, how do we know the version of environment.yml that's being committed is the same content that generated conda-lock.yml? The contributor could have updated environment.yml, locked it, then changed environment.yml again before committing.

Using pre-commit with --check-input-hash is a great solution for people like me who are going to clone the repo and work on their local machines, but changing environment.yml in the GitHub interface is also a common workflow for some projects/users.

What should happen?

On running some command like conda-lock ensure-sources or conda-lock lock --check-input-hash-without-solve:

Calculate a new content_hash from sources in conda-lock.yml
Does it match content_hash in conda-lock.yml?
If yes, exit 0, if no exit non-zero

No solve or write to a file should occur at any time.

This should also work for "explicit" lockfiles which contain an input_hash in the header.

Additional Context

I apologize if this has been asked already! Tried to find it and came up empty.

MattF-NSIDC commented 11 months ago

This feels like a "good first issue". I'd like to take a stab at it this weekend :)

maresb commented 11 months ago

Hi @MattF-NSIDC, great to hear from you again! This seems fairly straightforward, and I have no objection, but be warned that content_hash may disappear soon. Please see the discussion in #432.

MattF-NSIDC commented 11 months ago

Hey Ben, likewise :)

Do you expect that change will come with a version increment of the unified lock file format? I also think it would be a valuable to retain support for this validation functionality by enabling conda-lock to continue to output and operate on version 1 unified lockfiles!

maresb commented 11 months ago

Yes, we will definitely have a version increment and a transition period.

MattF-NSIDC commented 11 months ago

Wonderful :)

Do you think this suggestion (your pseudocode) are in-scope for a new conda-lock feature? This would address the same problem, but better IMO. Your point about equivalent but unequal specifications was a good one.

maresb commented 11 months ago

That would absolutely be in-scope for a new conda-lock feature! Would you be interested in tackling this? (I'd really like to do it myself, but I'm really time-constrained at the moment, so I'd be really grateful for any help!)

MattF-NSIDC commented 11 months ago

I'm going to try and give it a shot this weekend, will let you know whether I'm able to work it in or not! You may see a PR from my personal account @mfisher87 (I continue to regret having two GitHub accounts!)

maresb commented 11 months ago

That's wonderful!!!

If along the way you find tangential things that you want to improve / document, please be bold and don't hold back.

For code review, I greatly prefer lots of smaller commits rather than one monolith, if you can manage it.

MattF-NSIDC commented 11 months ago

For code review, I greatly prefer lots of smaller commits rather than one monolith, if you can manage it.

No problem :)

mfisher87 commented 11 months ago

I've been using this feature as a learning opportunity for some friends who are new to Python and open-source (and for myself to learn about the conda-lock code). We started with implementing the "check input hashes and exit" approach because it was a quicker path to victory, and didn't quite nail it down in our limited session this weekend. We'll probably get it next weekend and move on to the "check for agreement" approach. I was thinking we could submit them each as separate pull requests, and reject one of them if needed.

If someone else wants to implement any of this, we won't be offended if you don't wait for us. We're just in it for learning :)

conda / conda-lock