conda-incubator / conda-store

Data science environments, for collaboration. ✨
https://conda.store
BSD 3-Clause "New" or "Revised" License
145 stars 50 forks source link

ENH - Shorten hash used in environment path #611

Closed asmeurer closed 11 months ago

asmeurer commented 1 year ago

Context

Environment paths currently use a hash of the environment plus a timestamp plus the environment name. For example 99108419ad0fd922fdeff9bbc434b58d41f68e3f923a83f6a7ab19568463bc82-20230915-201619-848782-1-test for an environment named "test".

This path is very long (90 characters plus the environment name) and causes issues on Windows where the default filesystem limits paths to 260 characters, and some packages can have long paths. See https://github.com/conda-incubator/conda-store/issues/588. For example, notebook has a path with 122 characters (https://github.com/conda-incubator/conda-store/issues/588#issuecomment-1747541889), so 122 + 90 is 212 characters, leaving only 48 characters for the root of wherever the conda-store state directory is stored plus the environment name.

Long paths can also cause issues on other platforms for some packages that are built to only be able to be installed in paths up to 255 characters. See https://github.com/conda-incubator/conda-store/issues/588#issuecomment-1751556643

Value and/or benefit

On Windows, this can be fixed by setting a registry key, but this requires administrator privileges, so some deployments might not be able to do it.

Anything else?

See https://github.com/conda-incubator/conda-store/issues/588 for more technical details on the Windows path length problem.

My suggestion would be to use a shorter hash. We can incorporate the timestamp into the hash, since it is already stored in the database. However, we would need to make sure this is done in a way that is backwards compatible.

jaimergp commented 1 year ago

I'd consider using a different encoding for the hash. Our current base16 could be safely extended to base32 and hence we can afford a shorter hash. The date could be a shorter (yet less readable) timestamp.

If we want readability, I'd keep a series of symlinks to the hashed path. If you want to map back, you can always query the database. Convenience shouldn't trump usability.

nkaretnikov commented 1 year ago

[UPD: Nevermind, see my messages below. This is a conda limitation on all platforms, we cannot solve this just by using the extended length prefix.]

@jaimergp Not sure where base32 comes from, it's sha256 + some other stuff:

@property
def build_key(self):
    """A conda environment build is a function of the sha256 of the
    environment.yaml along with the time that the package was built.

    The last two parts of the build key are to assist finding the
    record in the database.

    The build key should be a key that allows for the environment
    build to be easily identified and found in the database.
    """
    datetime_format = "%Y%m%d-%H%M%S-%f"
    return f"{self.specification.sha256}-{self.scheduled_on.strftime(datetime_format)}-{self.id}-{self.specification.name}"

In https://github.com/conda-incubator/conda-store/issues/588, Aaron shows this example:

ERROR:root:[WinError 206] The filename or extension is too long: 'C:\\Users\\Aaron\\Quansight\\conda-store\\conda-store-server\\conda-store-state\\default\\99108419ad0fd922fdeff9bbc434b58d41f68e3f923a83f6a7ab19568463bc82-20230915-201619-848782-1-test\\Lib\\site-packages\\sympy\\parsing\\autolev\\test-examples\\pydy-example-repo'

Yes, our build_key contains redundant information, state directories could also use shorter names. However, even if we somehow had a small build_key that cannot exceed a certain size, part of the issue is the internal package directory structure, which we don't control. Yes, we can take a look at all existing packages and infer the current max directory path length, but it might still break one day.

IMO, a better solution, which shouldn't require changing much and won't affect other platforms is using the extended-length path prefix "\\?\" (32,767 chars max), as documented here:

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation

IIUC, this will cause Windows to use different APIs, which don't have this legacy path limit. Also, based on what've read, the support for this should go back to Windows 7, but it needs to be verified. The only issue I can think of is not being able to browse these paths using Explorer, but I think that was only a problem on XP, and it was updated to use different APIs on Windows 7, but again, this needs to be verified. Also, it's not something that should regularly happen as these are internal. Finally, these extended paths are supposed to be absolute (cannot use . or ..), but this doesn't seem like a problem for us.

A number of projects seem to be using this approach:

So I'm going to explore the extended-length path prefix "\\?\" first. [UPD: Nope, we'll need to change the build_key - see below.]

Not going to explore a different build_key for now, but it raises a number of questions if we decide to change it:

Requirements for new build_key if we decide to change it:

nkaretnikov commented 1 year ago

I forgot to mention advantages of the extended path prefix "\\?\":

Going to write a PR and report if anything breaks. [UPD: Going to shorten the build_key instead.]

nkaretnikov commented 1 year ago

OK, nevermind. Jaime told me this is needed because of point 1 here: https://github.com/conda-incubator/conda-store/issues/588#issuecomment-1788703931. So there's no other way than to shorten the build_key.

conda-build will use a default prefix placeholder of 255 characters. This means that the prefix base path (e.g. ~/.local/conda-store/envs/env-name-1234abcd) always needs to be under 255 characters. There should be a check that validates the configuration and ensures that base installation path + intermediate directories + default length of the hashed environment <= 255. This is for all platforms.

Repro: https://github.com/conda-incubator/conda-store/issues/588#issuecomment-1751556643

nkaretnikov commented 1 year ago

@jaimergp I've thought more about this in the context of the conda prefix. Let's look at Aaron's example, which exceeded the limit:

/private/var/folders/wc/dppcpmxs1tlb36nqcw853wkm0000gn/T/pytest-of-aaronmeurer/pytest-12/test_conda_store_register_envi0/conda-store-state/pytest-namespace/35f604188f69ceb5d9e3fae2c93ffd48c2971d192f21c776e3ebfed7e1196868-20231005-205956-556312-1-pytest-name

Summary: there are many things that are non-constant and can exceed the prefix length.

I think the long-term solution should be similar to what Nix uses: https://nixos.org/guides/nix-pills/nix-store-paths. The ideal scheme should be: ROOT + constant value (hash). We can instruct users to have a shorter ROOT and we'll know exactly how long that ROOT can be to not exceed the prefix (determined by the hash length). The main question here is how to select this hash. Can it be just the hash of the conda lockfile? IIUC, lockfiles are supposed to be deterministic. This will also allow for sharing when multiple users end up building the same env. However, we need to be careful about access to this shared resource (such as on deletion). I'll think more about this, but I welcome comments on this idea.

For now, I'll implement the truncated hash + timestamp idea, but it's a bandaid rather than a real solution. We should use something else long-term.

UPD: More info on Nix hashes: https://nixos.wiki/wiki/Nix_Hash

asmeurer commented 1 year ago

Conda itself would need to be updated to use \\?\. And anyway it could always happen that some other package gets installed that needs to access paths in the install prefix and doesn't use this API. The more robust fix is to use something shorter for the hash+timestamp.

nkaretnikov commented 12 months ago

Status update: submitted a WIP PR but need to address review feedback: https://github.com/conda-incubator/conda-store/pull/652.

nkaretnikov commented 11 months ago

Opened a separate issue for Nix-like hashes: https://github.com/conda-incubator/conda-store/issues/678