Closed staticfloat closed 4 years ago
when run from a Julia with a particular
<arch>-<os>-<calling_abi>-<gcc_abi>
(increasingly inaccurately named) triplet, the result ofPkg.package_state_dir(@__MODULE__)
could mutate accordingly.
It would be nice if Pkg.package_state_dir(@__MODULE__)
depends on package options https://github.com/JuliaLang/Juleps/issues/38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.
I would prefer that there is no
Pkg.build()
and instead each package is responsible for checking the existence of files within__init__()
Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).
It would be nice if Pkg.package_state_dir(@MODULE) depends on package options JuliaLang/Juleps#38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.
That's a very good thought; I'm going to add it on to the top issue. Probably the whole dict of options gets hashed and mixed in with the other elements determining the storage location.
Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).
Dependent packages should be loaded by the time you __init__()
a package. __init__()
is run after precompilation; it's a runtime function.
the whole dict of options gets hashed
My thoughts exactly!
Dependent packages should be loaded by the time you
__init__()
a package.
I guess I misunderstood that external libraries are somehow loaded during __init__()
as you are talking about getting rid of Pkg.build()
.
Ah, I see what you mean. You want to be sure that e.g. libPython
is loadable before PyCall.jl
runs its __init__()
method. Yes, this should be handled by the next step of BinaryBuilder
work that I'm doing; essentially separating binary dependencies out into their own packages (we're calling them .jll
packages) so they would be fully initialized by the time __init__()
begins for any dependent packages (e.g. PyCall.jl
depends on python.jll
, so by the time __init__()
gets called wtihin PyCall.jl
the python.jll
package has had the opportunity to download, install and dlopen()
its libPython
)
I found https://github.com/JuliaPackaging/BinaryBuilder.jl/wiki/Roadmap after writing the post below. It looks like PyCall.jl can just do something like using LibPython.jll: libpython
to get the handle. So I guess the following pattern is supported.
Can I access information about python.jll
package at precompile time of PyCall.jl
? PyCall.jl
needs to define struct
layout depending on Python version. For example:
struct PyDateTime_CAPI
# type objects:
DateType::PyPtr
DateTimeType::PyPtr
TimeType::PyPtr
DeltaType::PyPtr
TZInfoType::PyPtr
@static if pyversion >= v"3.7"
TimeZone_UTC::PyPtr
end
... and so on ...
end
where pyversion
is obtained by calling Python C API at precompile time:
const pyversion = vparse(split(Py_GetVersion(libpy_handle))[1])
PyCall.jl also inspects libpython
with hassym
at precompile time.
Yes, that’s right.
On Thu, Oct 11, 2018 at 09:26 Takafumi Arakaki notifications@github.com wrote:
I found https://github.com/JuliaPackaging/BinaryBuilder.jl/wiki/Roadmap after writing the post below. It looks like PyCall.jl can just do something like using LibPython.jll: libpython to get the handle. So I guess the following pattern is supported.
Can I access information about python.jll package at precompile time of PyCall.jl? PyCall.jl needs to define struct layout depending on Python version. For example:
struct PyDateTime_CAPI
type objects:
DateType::PyPtr DateTimeType::PyPtr TimeType::PyPtr DeltaType::PyPtr TZInfoType::PyPtr @static if pyversion >= v"3.7" TimeZone_UTC::PyPtr end ... and so on ...end
where pyversion is obtained by calling Python C API at precompile time:
const pyversion = vparse(split(Py_GetVersion(libpy_handle))[1])
PyCall.jl also inspects libpython with hassym at precompile time.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaLang/Pkg.jl/issues/796#issuecomment-428786346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH_aMcGacm_njt_ss_3NLU6Tfs96bbdks5ujp5agaJpZM4XJkDo .
Note that BinaryBuilder will probably never be a reasonable option for PyCall. You use Python for the ecosystem, not just libpython, and so we need to have access to a full-featured Python distro like Anaconda.
But we still need persistent per-package options. e.g. PyCall should be able to remember what python
you configured it to use. (Currently, in Julia 1.0, it forgets your configuration every time you update PyCall because Pkg fetches a fresh directory.) Package options should go into the Project.toml, probably?
I'm going to respond with https://github.com/JuliaLang/Pkg.jl/issues/777 in mind here:
On the one hand, this issue explicitly does not want to share state between different versions of packages, because the intended use case is for different .jll
package versions to be potentially completely different binary versions. On the other hand, it would be really nice to avoid needing to download and install stuff twice.
Oh if only we had some way of uniquely identifying the content we want to download/store, and we could use that unique identifier to key us into a directory! Oh wait, that's Stefan's content-addressable filesystem idea. So, new API idea that might satisfy everyone here: just pass a hash to Pkg.package_state_dir()
, and what you use to build that hash determines the lifecycle/sharing of your data. Examples (I'm using hash()
here as pseudo-code; not anything concrete):
Pkg.package_state_dir(hash(basename(pathof(@__MODULE__))))
: Scratch space that is shared across all installations of this package.
Pkg.package_state_dir(hash(libfoo_tarball_hash, libbar_tarball_hash))
: My .jll
package requests space keyed off of the content hashes of the tarballs I'm going to extract into it.
Pkg.package_state_dir(hash(basename(pathof(@__MODULE)), version.major, version.minor))
: SemVer-aware grouping.
Pkg.package_state_dir(hash(basename(pathof(@__MODULE)), options_dict))
: Options-keyed directory.
I think it makes a lot of sense to "deduplicate" based on a hash that the user passes to package_state_dir()
.
You use Python for the ecosystem, not just libpython
@stevengj If you want only Python packages, I think BinaryBuilder could be a reasonable option to install python
command and libpython
. Once python
command is installed, a reproducible Python environments can be constructed using Pipenv (which can already be done if https://github.com/JuliaPy/PyCall.jl/pull/578 is merged). Pipenv is much closer to Pkg3 in design. You don't need to treat mutable state yourself and an entire data for reproducing the Python environment is in two text files (actually JSON and TOML).
However, this is only for Python packages available from PyPI. For example, you can't install Node.js from PyPI (which is required for installing JupyterLab extension). But this probably can be covered by BinaryBuilder directly?
Pkg.package_state_dir(hash(basename(pathof(@__MODULE__))))
: Scratch space that is shared across all installations of this package.
@staticfloat Yeah, that's what I was thinking when connecting this to #777. Maybe it could be Pkg.package_state_dir(hash(Base.PkgId(@__MODULE__)))
but the idea is essentially the same.
Base.PkgId(@__MODULE__)
Yes, that is clearly superior. :)
Ah, I forgot another benefit of this; right now we get shared package state when you dev Foo
within the default environment from two different Julia versions. While the resolver will check to make sure that the Julia code is marked as satisfying all constraints, this can cause serious problems with two different versions of Julia built with two different versions of GCC.
So it's important to not only allow the user to specify what keys the package_state_dir()
, but also make it easy to (and probably default to) key off of Julia ABI stuff.
Updated API proposal:
Pkg.package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)
Where things_to_be_hashed
gets intelligently combined through a hash function, and the flags signify inclusion of information about Julia's version and ABI. These would be true
by default, but if set to false
then a package could be shared across Julia versions (within the same environment). Fleshing this out a little bit more, hashing should be fine with the UInt64 based hashes we use with hash()
in Base
(to get a 1-in-a-million chance of a collision, you need to have 6 million packages installed), so we could define this as something similar to:
function package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)
h = UInt64(0)
for t in things_to_be_hashed
h = hash(t, h)
end
if include_version
h = hash(Base.VERSION, h)
end
if include_ABI
# We would perhaps want to integrate this logic into Pkg
h = hash(BinaryProvider.triplet(BinaryProvider.platform_key_abi()), h)
end
return joinpath(Pkg.data_dir(), string(h, base=16))
end
@staticfloat Actually, using Pkg.package_state_dir
for both BinaryProvider and (say) Conda would make it hard to:
instead of nuking
~/.julia
entirely (as some still do to try and fix stale state problems) they could instead nuke~/.julia/package_state
because re-installing conda takes more time than downloading some binaries. Also, current Conda.jl has no mechanism for re-creating the same environment (at the moment).
It's probably better to have two kinds of state directories like XDG_DATA_HOME
(default: ~/.local/share
) and XDG_CACHE_HOME
(default: ~/.cache
) (and maybe also something similar to /var/
for e.g., *.log
and *.jl.mem
). For example, call them ~/.julia/data
and ~/.julia/cache
(where ~/.julia
would be replaced by DEPOT_PATH[1]
in real case). The distinction is that wiping out ~/.julia/cache
is safe in the sense ]instantiate
(or something) brings back to the equivalent environment while there is no such guarantee for ~/.julia/data
. The directory ~/.julia/data
is for, e.g., highly-mutable data like Conda.jl's and login authentication data for GitHub integration. The directory ~/.julia/cache
would be useful for BinaryProvider and also something like InstantiateFromURL.
I don't know if discussing "~/.julia/data
" here is preferred but it's probably better to have in mind that there may be other kind of data/state directories, when deciding the name under ~/.julia
.
On the other hand, above specification may sound over-complication (especially considering XDG compliance was rejected before). That's why I suggested #777; Pkg.jl could be just agnostic about what each package does and just provide a scratch space for it. Each package can then just implement it's own state/data strategy like package_state_dir
.
But since BinaryProvider should be working with Pkg closely, it may not be optimal here. So maybe just forget about making this a public API and expose to JuliaPackaging as semi-public API?
because re-installing conda takes more time than downloading some binaries.
I don't think this is a good reason to make "clearing state" not clear the Conda installation data. It seems to me that Conda.jl installed packages should be treated exactly the same way as BinaryProvider-downloaded packages; I don't see a clear difference between them.
Right, that was not appropriate reasoning. What I was trying to point out was that there are information/data more important than external libraries. In case of Conda.jl, that would be the version numbers and package origins. Although there is no direct easy way, a conda environment can have something like Project.toml
/Manifest.toml
(the complication was that there is no easy way to do this in conda
ATM). It would be a bad idea to store such information in the same directory where the binaries are stored. Another such example is authentication information: e.g., GitHub.jl can store authentication token in some directory but you wouldn't want to re-authenticate just because you cleaned the directory for the external libraries.
Stefan's notes from triage:
Mutable state in packages
@staticfloat wants a way to generate artifacts outside of packages Let packages generate/access a workspace Workspace keyed by package UUID Packages often want a scratch spaceExamples:
- big squashfs images need patching to match current user
- so caching of expensive work
- Conda.jl wants to persist across versions
- might be better to share between versions
“Lifecycled caches”: ~/.julia/caches
@StefanKarpinski: what if the workspace is a project?
- or maybe a project with an Artifacts.toml file
- the actual data goes in there as artifactsDo we want levels of caching:
- mutable workspace needs to be per-user
- does a more system-wide cache make sense?
And with the official announcement of Scratch.jl
, I think this can be closed. :)
I think it would be desirable to have packages use something similar to
Pkg.package_state_dir(@__MODULE__)
or something as the default location where e.g. binaries, datasets, etc.. should be stored. This would be constructed as an overall per-environment state directory (overridable by an environment variable or environment config key perhaps), that then has hashed subdirectories similar to~/.julia/packages
but explicitly including information to disambiguate julia OS, arch, calling ABI, GCC ABI and package options. This has multiple benefits;Packages become more "immutable". It would be lovely to be certain that the entire tree hash of a package directory inside of
~/.julia/packages
never changes.Packages get automatically pushed toward greater relocatability. As recent experiments with PackageCompiler have shown, broad-spectrum usage of things like
@__FILE__
and@__DIR__
should be discouraged anyway. A common use case is for creating a scratch space for binaries (e.g.<pkg dir>/deps/usr
), but others exist (downloading datasets, generating julia code, etc... Forcing a runtime lookup based on@__MODULE__
is already what we need to do, so this would dovetail nicely.Pkg3 package resolution could technically be arch/OS agnostic. I'm imagining the nightmare scenario where Crazy Charlie has installed three copies of Julia 1.0, one with GCC 8 targeting x86_64, one with GCC 7 targeting x86_64 and one with GCC 6 targeting i686. Technically, we could share the actual Julia package directories, but when run from a Julia with a particular
<arch>-<os>-<calling_abi>-<gcc_abi>
(increasingly inaccurately named) triplet, the result ofPkg.package_state_dir(@__MODULE__)
could mutate accordingly.The storage directory for mutable state could be decoupled from the storage for julia code. Imagine a heterogenous cluster with various CPUs and a shared package depot that is provided to all users; not only would the
.ji
files generated differ on each machine, but the downloaded binaries could differ as well (taking advantage of different ISAs). This separation between Julia code and build/run-time content would solve both the "must provide binaries that work on every platform globally" problem and the "I don't have permissions to modify packages placed in this depot" problemThis could make the "nuclear" state reset option a little easier for users; instead of nuking
~/.julia
entirely (as some still do to try and fix stale state problems) they could instead nuke~/.julia/package_state
or whatever we default the location to. This would essentially cause a "rebuild" of every package, as if it were freshly installed, without doing more drastic things like losing the set of installed packages.This would allow for, (in my mind) a cleaner workflow for managing package state than the current
Pkg.build()
system; I would prefer that there is noPkg.build()
and instead each package is responsible for checking the existence of files within__init__()
; this can be done extremely quickly (e.g.isdir(joinpath(Pkg.package_state_dir(@__MODULE__), "usr"))
) and should remove one more minor pain point in Pkg, the "This package was not properly installed, please runPkg.build(<pkg name>)
error message.