conda / ceps

Conda Enhancement Proposals
Creative Commons Zero v1.0 Universal
19 stars 24 forks source link

initial cep for repodata state #46

Open wolfv opened 1 year ago

wolfv commented 1 year ago

Created a quick CEP for the new repodata state format (cc @dholth as we had some discussions about this. Happy to list you as author!).

Also regarding the spec happy to make changes. My hope is just that mamba and conda can both share the same format.

dholth commented 1 year ago

One thing about this is that when you start using alternative formats (.zst, .jlap) the remote headers last-modified, etag, cache-control come from the alternate file.

dholth commented 1 year ago

An example of the current jlap branch's .state.json.

have is the nominal hash (what the hash of the original repodata.json was according to jlap)

have_hash is the actual hash on disk since we don't serialize with exactly the same sorting, formatting as conda-index. Could be used instead of mtime (if file on disk doesn't match have_hash, then it doesn't correspond to this state.json)

jlap includes too many headers from the jlap request, an intermediate hash iv corresponding to pos- bytes in the file, and the last line of the jlap file.

{
 "_url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json",
 "_mod": "Fri, 06 Jan 2023 20:09:20 GMT",
 "_etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
 "_cache_control": "public, max-age=30",
 "have": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "have_hash": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f",
 "mtime": 1673274986.4032788,
 "jlap": {
  "headers": {
   "date": "Mon, 09 Jan 2023 14:36:25 GMT",
   "content-type": "text/plain",
   "transfer-encoding": "chunked",
   "connection": "keep-alive",
   "x-amz-id-2": "81qDgERSlA/bEpQQeL/YBn3BniAaB37uUkbD5ZySYC/h9JWb+8Sbg1ik70ufAvNtzTeGHqiwZHI=",
   "x-amz-request-id": "Q1HHT9KCY69TQXQX",
   "last-modified": "Fri, 06 Jan 2023 20:09:20 GMT",
   "x-amz-version-id": "BaggXYx0RmtOxe4B6PIl3IDX_G8Ryt5X",
   "etag": "W/\"0134da379063a17831ef4ed73d3489dd\"",
   "cf-cache-status": "MISS",
   "expires": "Mon, 09 Jan 2023 14:36:55 GMT",
   "cache-control": "public, max-age=30",
   "set-cookie": "__cf_bm=YfYijeZ0Y8xc_XBZebA1UA1bX9uz47v67b3guqZFfY0-1673274985-0-Ab9B1pIhPfM0fQkGk5rTS9A5vvc3tPD37jnV+pmXSI2C82sgdxdKaBCB3zhj4wQ6P1yVmaYRpioaARDTmt6H5s0=; path=/; expires=Mon, 09-Jan-23 15:06:25 GMT; domain=.anaconda.com; HttpOnly; Secure; SameSite=None",
   "vary": "Accept-Encoding",
   "server": "cloudflare",
   "cf-ray": "786de7b3ffe9244d-ATL",
   "content-encoding": "gzip"
  },
  "iv": "2f598f0587410d455c8370bef19759fd7a25b4aad4d65c3b3b1be7c7422a938c",
  "pos": 1714976,
  "footer": {
   "url": "repodata.json",
   "latest": "e45656091705b1be55b72a3b48068520cc3309560ca0d37fb96f2b2ea559c81f"
  }
 }
}
dholth commented 1 year ago

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

wolfv commented 1 year ago

Any chance you have time to make a PR against my branch with your change suggestions? Or I can also try to give you edit rights, if you want :)

dholth commented 1 year ago

@wolfv do you mean https://github.com/wolfv/ceps/pull/1

dholth commented 1 year ago

If we are going to play with nanoseconds, let's go ahead and replace all timestamps (except those that are web server headers) with those numbers. e.g. last_checked.

We will quickly release a conda with the main last_modified, cache_control, etag state but it will take us a few more releases to get to "last checked zstd"

Do we standardize how environment locking works?

wolfv commented 1 year ago

environment locking as in conda-lock or as in filesystem lockfiles to prevent overwriting things?

wolfv commented 1 year ago

My reasoning for the different formats is that for the mtime checking it is "precise" since we actually want to match the file on disk.

For the "last time checked zst" we just want a timestamp so we can check that it's been more than 2 weeks and it doesn't need to be precise. We had this function around to create a RFC3339 string representation so I just used that. We could also use nanoseconds but this one doesn't need to be precise.

dholth commented 1 year ago

I tried out micromamba's January release, and it downloads repodata.json.zst very quickly. Producing a state file included below.

In Python it is easier to store nanoseconds as a single number, time.time_ns().bit_length() is only 61 bits today.

In [5]: datetime.datetime.fromtimestamp(2**64//1e9) Out[5]: datetime.datetime(2554, 7, 21, 19, 34, 33)

In the jlap branch I store "jlap_unavailable" as a timestamp, assuming you check alternative formats in a known order of preference unless you know they are 404's.

{
    "cache_control": "public, max-age=30",
    "etag": "\"a9c77cc4c9b1a947375d53326f1604de\"",
    "file_mtime": {
        "nanoseconds": 960132000,
        "seconds": 1674678293
    },
    "file_size": 6974249,
    "has_zst": {
        "last_checked": "2023-01-27T02:09:53Z",
        "value": true
    },
    "mod": "Wed, 25 Jan 2023 19:54:35 GMT",
    "url": "https://repo.anaconda.com/pkgs/main/osx-arm64/repodata.json.zst"
}
dholth commented 1 year ago

Am working on a "lock byte 21" implementation like mamba; where we lock the .state.json before doing anything else (even before reading or stat'ing cached repodata.json), and .state.json is the only lockfile. Then we always try to keep the lock for as short a time as possible, e.g. download repodata.json to a temp file, then stat it, then move it on top of the cache filename. This locking is only for the integrity of the repodata.json cache, not per-directory locking to prevent package cache overwrites etc.

On Windows, it might be more appropriate to lock and overwrite repodata.json, instead of the unix style of atomically moving a tempfile on top of the desired file (on Windows, you cannot move a file on top of an existing file; you have to delete the existing file first)

wolfv commented 1 year ago

you might also run into issues with atomic renames if your temporary file is on a different fs. /tmp is often a different fs :) So I'd stat it after moving.

dholth commented 1 year ago

The temporary file is in the same cache folder.

baszalmstra commented 1 year ago

@dholth Does your implementation also produce a <hash>.lock file (like mamba) or does the <hash>.state.json file also function as a lock file? I guess that would also makes a lot of sense. Would you be able to formalize that in the CEP @wolfv ?

In the existing format "_url": "https://conda.anaconda.org/conda-forge/osx-arm64", doesn't contain the filename. May want to continue doing that especially since different repodata variants matter. Or ignore the field.

I agree with this. I would keep it though to ensure hash collisions don't form an issue. I also like that I can find the original URL from the hash. Can we maybe formalize that in the CEP @wolfv ?

Have implemented (nicer) non-underscored names in a conda branch.

@dholth What did you end up calling these? I especially think the mod could just be named last_modified, similar to the HTTP header from where it comes. Also, shouldn't a few of these fields be optional, given that not all HTTP servers return these headers. WDYT @wolfv ?

dholth commented 1 year ago

For example I've been working in this branch, link should take you to conda/gateways/repodata/init.py with code to handle the ISO timestamps. The RepodataState class should clearly show the current format.

From my reading of the mamba code, it locks a certain byte in the repodata.json / repodata.state.json files. I didn't notice it creating a .lock file although that is maybe a more old-school technique? and necessary for locking a complete directory - I haven't attempted to lock complete directories in my branch. So far I am only trying to maintain the integrity of the repodata.json cache and don't do anything to prevent e.g. parallel conda's downloading it twice, they simply won't corrupt the cache. Feedback & PR's against PR's welcome.

dholth commented 1 year ago

@baszalmstra our implementation assumes all the keys are optional, same as if the file was missing. We also treat the keys as missing if state.json doesn't match the cached repodata.json file. We could formally add that to the CEP.

baszalmstra commented 1 year ago

Our state data structure in rattler looks like this.

You may note that besides checking the timestamp and the size we also check the repodata.json against a blake2 hash. If the timestamp and the size won't match but the blake2 hash does still match, we consider the data to be up-to-date.

I like the idea of having all keys be optional. Currently, the mtime_ns, size, and url in our implementation are not optional.

We also create an extra lockfile (.lock) that guards both the repodata.json and the state.json file. I think this is also what mamba is doing by observing its behavior. But looking at the code mamba is creating several lock files throughout the process on several different files.

dholth commented 1 year ago

I was preparing to add the hash as well, at least for jlap. I have two hash fields called NOMINAL_HASH = "nominal_hash" ON_DISK_HASH = "actual_hash". nominal_hash is the hash according to the .jlap file and actual_hash is the hash after we json.dumps the updated data

dholth commented 1 year ago

I assume mamba is also trying to lock whole directories.

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

baszalmstra commented 1 year ago

The development conda implementation also does the trick of writing the new repodata.json to a new file, stat'ing the temporary name, and moving it on top of the old file.

Yeah Rattler does the same. (On windows we use some Win32 API to achieve this).

We are always using BLAKE2(256) which are the same length as sha-256 hashes, but "by default" blake2 produces an overkill 512-bit hash.

Still, it might be nice to include the algorithm used in either the key or value. It doesn't add overhead but makes it easier for others to deduce whats going on. We could also do "on_disk_hash" : "blake2:blabla..."?

dholth commented 1 year ago

For anything crypto-adjacent I'd let the version # fix the exact hash used.

baszalmstra commented 1 year ago

In that case, since we can just change keys when changing versions anyway, let's name the key something with blake2. At least then it's clear from reading the file. You also have this in the repodata with sha2 or md5.

dholth commented 1 year ago

Parametrize important key names https://github.com/conda/conda/pull/12461/files#diff-813ca3bd61f56355fb3ea7c560d18b892d42d62a07fa0e78666dcfa50c5fda13R64