conda / conda-lock

Lightweight lockfile for conda environments
https://conda.github.io/conda-lock/
Other
459 stars 102 forks source link

Reconstruct RepodataRecord from lock file #433

Open baszalmstra opened 1 year ago

baszalmstra commented 1 year ago

Checklist

What is the idea?

While working on rattler I ran into a situation where I wanted to update lock files incrementally.

To do so I need to pass the "currently installed packages" to the solver. The solver prioritizes the "installed" variants of a package over others which nudges the solver into using the installed package variants.

As far as I understand it, conda-lock facilitates incremental lock file updates by creating a fake environment with fake "reconstructed" conda-meta files. In Rattler (and in conda too), the files in the conda-meta folder are represented as PrefixRecords. These are a superset of RepoDataRecord which are in turn a superset of PackageRecords.

To be able to do a "perfect" incremental lock file update we would ideally completely reconstruct the information of the RepoDataRecord from the conda-lock file and pass that to the solver. In a regular conda update this information is typically read from the conda-meta directory. Complete reconstruction is important because if a package was locked that is no longer available in the repodata.json (for whatever reason) the lock file remains valid, even when updating parts of the lock file. Its also important because according to MatchSpec dependencies can match on any of the RepoDataRecord fields.

The issue I run into is that the current conda-lock file format does not easily allow the proper reconstruction of RepoDataRecords. The current models have a single definition of a LockedDependency for both pip and conda packages. I propose we implement this differently through a union of either a PipLockedDependency (of which I know not enough to describe what it should look like) and CondaLockedDependency which would allow the complete reconstruction of a RepoDataRecord. I believe that micromamba, mamba, and conda expose enough information to do so.

Currently, things that are hard to reproduce are:

Why is this needed?

As explained in "what is the idea?", this is needed to be able to do proper incremental lock file updates.

What should happen?

No response

Additional Context

No response

baszalmstra commented 1 year ago

In rattler we added some additional fields to the LockedDependencys to be able to completely reconstruct RepoDataRecords from conda lock files.

The fields we added are:

/// Experimental: architecture field
pub arch: Option<String>,

/// Experimental: the subdir where the package can be found
pub subdir: Option<String>,

/// Experimental: conda build number of the package
pub build_number: Option<u64>,

/// Experimental: see: [Constrains](crate::repo_data::PackageRecord::constrains)
pub constrains: Vec<String>,

/// Experimental: see: [Features](crate::repo_data::PackageRecord::features)
pub features: Option<String>,

/// Experimental: see: [Track features](crate::repo_data::PackageRecord::track_features)
pub track_features: Vec<String>,

/// Experimental: the specific license of the package
pub license: Option<String>,

/// Experimental: the license family of the package
pub license_family: Option<String>,

/// Experimental: If this package is independent of architecture this field specifies in what way. See
/// [`NoArchType`] for more information.
pub noarch: NoArchType,

/// Experimental: The size of the package archive in bytes
pub size: Option<u64>,

/// Experimental: The date this entry was created.
pub timestamp: Option<chrono::DateTime<chrono::Utc>>,

We also discovered another important reason to do so. When rattler (and micromamba) create an environment from a lock-file without reading additional repodata all the information that is stored in the conda-meta/ folder is retrieved from the conda lock file. However, since some information is missing (like licenses) some tools fail to work properly when using environments installed from lock-files.

I propose we add the same fields in conda-lock! :)

maresb commented 1 year ago

Yes, I've been wanting to implement something like this for a while. One of the most confusing parts for me of the conda-lock codebase is understanding when a dependency is conda or pip or either.

I've been especially fond for quite some time of the idea of including the timestamp data, since the maximum over timestamps gives an approximate but stable last-locked time.

baszalmstra commented 1 year ago

To me, it makes sense to have two alternative data structures (LockedCondaDependency and LockedPipDependency). Conda and Python have relatively different fields and mixing the two seems complicated.