JuliaLang / Pkg.jl

Pkg - Package manager for the Julia programming language
https://pkgdocs.julialang.org
Other
621 stars 269 forks source link

Sequester mutable state outside of package directories #796

Closed staticfloat closed 4 years ago

staticfloat commented 6 years ago

I think it would be desirable to have packages use something similar to Pkg.package_state_dir(@__MODULE__) or something as the default location where e.g. binaries, datasets, etc.. should be stored. This would be constructed as an overall per-environment state directory (overridable by an environment variable or environment config key perhaps), that then has hashed subdirectories similar to ~/.julia/packages but explicitly including information to disambiguate julia OS, arch, calling ABI, GCC ABI and package options. This has multiple benefits;

tkf commented 6 years ago

when run from a Julia with a particular <arch>-<os>-<calling_abi>-<gcc_abi> (increasingly inaccurately named) triplet, the result of Pkg.package_state_dir(@__MODULE__) could mutate accordingly.

It would be nice if Pkg.package_state_dir(@__MODULE__) depends on package options https://github.com/JuliaLang/Juleps/issues/38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.

I would prefer that there is no Pkg.build() and instead each package is responsible for checking the existence of files within __init__()

Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).

staticfloat commented 6 years ago

It would be nice if Pkg.package_state_dir(@MODULE) depends on package options JuliaLang/Juleps#38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.

That's a very good thought; I'm going to add it on to the top issue. Probably the whole dict of options gets hashed and mixed in with the other elements determining the storage location.

Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).

Dependent packages should be loaded by the time you __init__() a package. __init__() is run after precompilation; it's a runtime function.

tkf commented 6 years ago

the whole dict of options gets hashed

My thoughts exactly!

Dependent packages should be loaded by the time you __init__() a package.

I guess I misunderstood that external libraries are somehow loaded during __init__() as you are talking about getting rid of Pkg.build().

staticfloat commented 6 years ago

Ah, I see what you mean. You want to be sure that e.g. libPython is loadable before PyCall.jl runs its __init__() method. Yes, this should be handled by the next step of BinaryBuilder work that I'm doing; essentially separating binary dependencies out into their own packages (we're calling them .jll packages) so they would be fully initialized by the time __init__() begins for any dependent packages (e.g. PyCall.jl depends on python.jll, so by the time __init__() gets called wtihin PyCall.jl the python.jll package has had the opportunity to download, install and dlopen() its libPython)

tkf commented 6 years ago

I found https://github.com/JuliaPackaging/BinaryBuilder.jl/wiki/Roadmap after writing the post below. It looks like PyCall.jl can just do something like using LibPython.jll: libpython to get the handle. So I guess the following pattern is supported.


Can I access information about python.jll package at precompile time of PyCall.jl? PyCall.jl needs to define struct layout depending on Python version. For example:

struct PyDateTime_CAPI
    # type objects:
    DateType::PyPtr
    DateTimeType::PyPtr
    TimeType::PyPtr
    DeltaType::PyPtr
    TZInfoType::PyPtr

    @static if pyversion >= v"3.7"
        TimeZone_UTC::PyPtr
    end

    ... and so on ...
end

--- https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/pydates.jl#L12-L35

where pyversion is obtained by calling Python C API at precompile time:

const pyversion = vparse(split(Py_GetVersion(libpy_handle))[1])

--- https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/startup.jl#L85

PyCall.jl also inspects libpython with hassym at precompile time.

staticfloat commented 6 years ago

Yes, that’s right.

On Thu, Oct 11, 2018 at 09:26 Takafumi Arakaki notifications@github.com wrote:

I found https://github.com/JuliaPackaging/BinaryBuilder.jl/wiki/Roadmap after writing the post below. It looks like PyCall.jl can just do something like using LibPython.jll: libpython to get the handle. So I guess the following pattern is supported.

Can I access information about python.jll package at precompile time of PyCall.jl? PyCall.jl needs to define struct layout depending on Python version. For example:

struct PyDateTime_CAPI

type objects:

DateType::PyPtr
DateTimeType::PyPtr
TimeType::PyPtr
DeltaType::PyPtr
TZInfoType::PyPtr

@static if pyversion >= v"3.7"
    TimeZone_UTC::PyPtr
end

... and so on ...end

https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/pydates.jl#L12-L35

where pyversion is obtained by calling Python C API at precompile time:

const pyversion = vparse(split(Py_GetVersion(libpy_handle))[1])


https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/startup.jl#L85

PyCall.jl also inspects libpython with hassym at precompile time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/Pkg.jl/issues/796#issuecomment-428786346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH_aMcGacm_njt_ss_3NLU6Tfs96bbdks5ujp5agaJpZM4XJkDo .

stevengj commented 6 years ago

Note that BinaryBuilder will probably never be a reasonable option for PyCall. You use Python for the ecosystem, not just libpython, and so we need to have access to a full-featured Python distro like Anaconda.

But we still need persistent per-package options. e.g. PyCall should be able to remember what python you configured it to use. (Currently, in Julia 1.0, it forgets your configuration every time you update PyCall because Pkg fetches a fresh directory.) Package options should go into the Project.toml, probably?

staticfloat commented 6 years ago

I'm going to respond with https://github.com/JuliaLang/Pkg.jl/issues/777 in mind here:

On the one hand, this issue explicitly does not want to share state between different versions of packages, because the intended use case is for different .jll package versions to be potentially completely different binary versions. On the other hand, it would be really nice to avoid needing to download and install stuff twice.

Oh if only we had some way of uniquely identifying the content we want to download/store, and we could use that unique identifier to key us into a directory! Oh wait, that's Stefan's content-addressable filesystem idea. So, new API idea that might satisfy everyone here: just pass a hash to Pkg.package_state_dir(), and what you use to build that hash determines the lifecycle/sharing of your data. Examples (I'm using hash() here as pseudo-code; not anything concrete):

I think it makes a lot of sense to "deduplicate" based on a hash that the user passes to package_state_dir().

tkf commented 6 years ago

You use Python for the ecosystem, not just libpython

@stevengj If you want only Python packages, I think BinaryBuilder could be a reasonable option to install python command and libpython. Once python command is installed, a reproducible Python environments can be constructed using Pipenv (which can already be done if https://github.com/JuliaPy/PyCall.jl/pull/578 is merged). Pipenv is much closer to Pkg3 in design. You don't need to treat mutable state yourself and an entire data for reproducing the Python environment is in two text files (actually JSON and TOML).

However, this is only for Python packages available from PyPI. For example, you can't install Node.js from PyPI (which is required for installing JupyterLab extension). But this probably can be covered by BinaryBuilder directly?

Pkg.package_state_dir(hash(basename(pathof(@__MODULE__)))): Scratch space that is shared across all installations of this package.

@staticfloat Yeah, that's what I was thinking when connecting this to #777. Maybe it could be Pkg.package_state_dir(hash(Base.PkgId(@__MODULE__))) but the idea is essentially the same.

staticfloat commented 6 years ago

Base.PkgId(@__MODULE__)

Yes, that is clearly superior. :)

staticfloat commented 6 years ago

Ah, I forgot another benefit of this; right now we get shared package state when you dev Foo within the default environment from two different Julia versions. While the resolver will check to make sure that the Julia code is marked as satisfying all constraints, this can cause serious problems with two different versions of Julia built with two different versions of GCC.

So it's important to not only allow the user to specify what keys the package_state_dir(), but also make it easy to (and probably default to) key off of Julia ABI stuff.

Updated API proposal:

Pkg.package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)

Where things_to_be_hashed gets intelligently combined through a hash function, and the flags signify inclusion of information about Julia's version and ABI. These would be true by default, but if set to false then a package could be shared across Julia versions (within the same environment). Fleshing this out a little bit more, hashing should be fine with the UInt64 based hashes we use with hash() in Base (to get a 1-in-a-million chance of a collision, you need to have 6 million packages installed), so we could define this as something similar to:

function package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)
    h = UInt64(0)
    for t in things_to_be_hashed
        h = hash(t, h)
    end
    if include_version
        h = hash(Base.VERSION, h)
    end
    if include_ABI
        # We would perhaps want to integrate this logic into Pkg
        h = hash(BinaryProvider.triplet(BinaryProvider.platform_key_abi()), h)
    end

    return joinpath(Pkg.data_dir(), string(h, base=16))
end
tkf commented 6 years ago

@staticfloat Actually, using Pkg.package_state_dir for both BinaryProvider and (say) Conda would make it hard to:

instead of nuking ~/.julia entirely (as some still do to try and fix stale state problems) they could instead nuke ~/.julia/package_state

because re-installing conda takes more time than downloading some binaries. Also, current Conda.jl has no mechanism for re-creating the same environment (at the moment).

It's probably better to have two kinds of state directories like XDG_DATA_HOME (default: ~/.local/share) and XDG_CACHE_HOME (default: ~/.cache) (and maybe also something similar to /var/ for e.g., *.log and *.jl.mem). For example, call them ~/.julia/data and ~/.julia/cache (where ~/.julia would be replaced by DEPOT_PATH[1] in real case). The distinction is that wiping out ~/.julia/cache is safe in the sense ]instantiate (or something) brings back to the equivalent environment while there is no such guarantee for ~/.julia/data. The directory ~/.julia/data is for, e.g., highly-mutable data like Conda.jl's and login authentication data for GitHub integration. The directory ~/.julia/cache would be useful for BinaryProvider and also something like InstantiateFromURL.

I don't know if discussing "~/.julia/data" here is preferred but it's probably better to have in mind that there may be other kind of data/state directories, when deciding the name under ~/.julia.

On the other hand, above specification may sound over-complication (especially considering XDG compliance was rejected before). That's why I suggested #777; Pkg.jl could be just agnostic about what each package does and just provide a scratch space for it. Each package can then just implement it's own state/data strategy like package_state_dir.

But since BinaryProvider should be working with Pkg closely, it may not be optimal here. So maybe just forget about making this a public API and expose to JuliaPackaging as semi-public API?

staticfloat commented 6 years ago

because re-installing conda takes more time than downloading some binaries.

I don't think this is a good reason to make "clearing state" not clear the Conda installation data. It seems to me that Conda.jl installed packages should be treated exactly the same way as BinaryProvider-downloaded packages; I don't see a clear difference between them.

tkf commented 6 years ago

Right, that was not appropriate reasoning. What I was trying to point out was that there are information/data more important than external libraries. In case of Conda.jl, that would be the version numbers and package origins. Although there is no direct easy way, a conda environment can have something like Project.toml/Manifest.toml (the complication was that there is no easy way to do this in conda ATM). It would be a bad idea to store such information in the same directory where the binaries are stored. Another such example is authentication information: e.g., GitHub.jl can store authentication token in some directory but you wouldn't want to re-authenticate just because you cleaned the directory for the external libraries.

fredrikekre commented 5 years ago

Stefan's notes from triage:

Mutable state in packages

@staticfloat wants a way to generate artifacts outside of packages Let packages generate/access a workspace Workspace keyed by package UUID Packages often want a scratch spaceExamples:

  • big squashfs images need patching to match current user
    • so caching of expensive work
  • Conda.jl wants to persist across versions
    • might be better to share between versions

“Lifecycled caches”: ~/.julia/caches

@StefanKarpinski: what if the workspace is a project?

  • or maybe a project with an Artifacts.toml file
  • the actual data goes in there as artifactsDo we want levels of caching:
  • mutable workspace needs to be per-user
  • does a more system-wide cache make sense?
staticfloat commented 4 years ago

And with the official announcement of Scratch.jl, I think this can be closed. :)