Reconstruct RepodataRecord from lock file

baszalmstra commented 1 year ago

Checklist

[X] I added a descriptive title
[X] I searched open requests and couldn't find a duplicate

What is the idea?

While working on rattler I ran into a situation where I wanted to update lock files incrementally.

To do so I need to pass the "currently installed packages" to the solver. The solver prioritizes the "installed" variants of a package over others which nudges the solver into using the installed package variants.

As far as I understand it, conda-lock facilitates incremental lock file updates by creating a fake environment with fake "reconstructed" conda-meta files. In Rattler (and in conda too), the files in the conda-meta folder are represented as PrefixRecords. These are a superset of RepoDataRecord which are in turn a superset of PackageRecords.

PackageRecord contains data read from repodata.json files.
RepoDataRecord contains the same data as PackageRecords but amended with information about the channel (like the URL of the package, the filename, and a string representation of the channel).
PrefixRecord contains the same data as RepoDataRecord but additionally includes information about how the package was installed.

To be able to do a "perfect" incremental lock file update we would ideally completely reconstruct the information of the RepoDataRecord from the conda-lock file and pass that to the solver. In a regular conda update this information is typically read from the conda-meta directory. Complete reconstruction is important because if a package was locked that is no longer available in the repodata.json (for whatever reason) the lock file remains valid, even when updating parts of the lock file. Its also important because according to MatchSpec dependencies can match on any of the RepoDataRecord fields.

The issue I run into is that the current conda-lock file format does not easily allow the proper reconstruction of RepoDataRecords. The current models have a single definition of a LockedDependency for both pip and conda packages. I propose we implement this differently through a union of either a PipLockedDependency (of which I know not enough to describe what it should look like) and CondaLockedDependency which would allow the complete reconstruction of a RepoDataRecord. I believe that micromamba, mamba, and conda expose enough information to do so.

Currently, things that are hard to reproduce are:

The channel from which the package was installed (could potentially be derived from the URL though)
The arch and subdir field
The build_number field (very important)
constrains
features and track_features
license and license_family
noarch
size
timestamp

Why is this needed?

As explained in "what is the idea?", this is needed to be able to do proper incremental lock file updates.

What should happen?

No response

Additional Context

No response

baszalmstra commented 1 year ago

In rattler we added some additional fields to the LockedDependencys to be able to completely reconstruct RepoDataRecords from conda lock files.

The fields we added are:

/// Experimental: architecture field
pub arch: Option<String>,

/// Experimental: the subdir where the package can be found
pub subdir: Option<String>,

/// Experimental: conda build number of the package
pub build_number: Option<u64>,

/// Experimental: see: [Constrains](crate::repo_data::PackageRecord::constrains)
pub constrains: Vec<String>,

/// Experimental: see: [Features](crate::repo_data::PackageRecord::features)
pub features: Option<String>,

/// Experimental: see: [Track features](crate::repo_data::PackageRecord::track_features)
pub track_features: Vec<String>,

/// Experimental: the specific license of the package
pub license: Option<String>,

/// Experimental: the license family of the package
pub license_family: Option<String>,

/// Experimental: If this package is independent of architecture this field specifies in what way. See
/// [`NoArchType`] for more information.
pub noarch: NoArchType,

/// Experimental: The size of the package archive in bytes
pub size: Option<u64>,

/// Experimental: The date this entry was created.
pub timestamp: Option<chrono::DateTime<chrono::Utc>>,

We also discovered another important reason to do so. When rattler (and micromamba) create an environment from a lock-file without reading additional repodata all the information that is stored in the conda-meta/ folder is retrieved from the conda lock file. However, since some information is missing (like licenses) some tools fail to work properly when using environments installed from lock-files.

I propose we add the same fields in conda-lock! :)

maresb commented 1 year ago

Yes, I've been wanting to implement something like this for a while. One of the most confusing parts for me of the conda-lock codebase is understanding when a dependency is conda or pip or either.

I've been especially fond for quite some time of the idea of including the timestamp data, since the maximum over timestamps gives an approximate but stable last-locked time.

baszalmstra commented 1 year ago

To me, it makes sense to have two alternative data structures (LockedCondaDependency and LockedPipDependency). Conda and Python have relatively different fields and mixing the two seems complicated.

conda / conda-lock