Partial Artifacts - Githubissues

staticfloat commented 4 years ago

Problem statement

Artifacts are indivisible chunks of data. It would be useful to have the capability to slice them into unions of subtrees, to allow for flexible installation of partial artifacts. Motivating examples:

In JLL packages, there is a clear delineation between the data needed at (C/C++) compile time, and the data needed at runtime. Binaries and dynamic libraries are generally the only things needed at runtime, however when building a library that will link against it, the headers (and possibly even static libraries) are needed. If something goes wrong, debugging symbols can be very helpful. Having the ability to download only the pieces needed for the basic situation would be great. This use case could be fulfilled by simply generating a minimal, build and full (which includes debug files) artifact for each and every JLL package, but it would be wasteful of server storage and bandwidth, as if the debug information is stored in separate files, these artifacts are all strict subsets of eachother.
In distributed analysis, datasets are often split into multiple files, and each node may only need a small subset of files from the overall dataset. Having a single large artifact for the entire dataset is simple, but wasteful of time and bandwidth. Splitting the dataset into separate artifacts for each worker is inflexible, and splitting the dataset into thousands of small artifacts, allowing each worker to grab the ones it wants is bothersome.

Proposed solution

Artifacts are, in their most abstract sense, content trees that are identified by their treehash, typically with some extra information about where to download it from. A content tree contains within it many sub-content trees, and one can construct a very large number of trees that are unions of those sub-content trees. If the Pkg client can communicate to a Pkg server a desired union of subtrees, the Pkg server can generate a tarball for that union that has its own content treehash, and serve it to the client.

Specification of union trees

The Pkg client will request an artifact from the Pkg server, including with the request a set of sub-tree hashes. Each subtree hash must be a primal treehash, e.g. one that corresponds to a leaf or full subtree of the full content tree.

As an example, here is a file tree, with example file hashes at each node:

151e8ff64bf82449ba700f35800ccf4dd7fa6c6b Antic_jll
45979ab0ea4b5a6b75542451b1fa43157c7ed66d ├── include
fe0ea5902b21d3ce0e2ee321d34d97d848be6b60 │   ├── antic
0b8f4b014d8449c25d857966d7c9257bc47f97b4 │   │   ├── nf_elem.h
3b226dd64bf2c56ed76912182f7388fd3c28838d │   │   ├── nf.h
38d9b42383972a7a500861aa9079adb9f499e37c │   │   └── qfb.h
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 │   └── antic.h
2424fac4ebaedc111308e12465363e640cd1b7dd └── lib
c1d4423e4e089a0890f3f5d9677d7bbd56388423     ├── libantic.so -> libantic.so.0.0.1
c1d4423e4e089a0890f3f5d9677d7bbd56388423     ├── libantic.so.0 -> libantic.so.0.0.1
35938f6e1b7765b32cf6c2b014c24de3cc116b00     └── libantic.so.0.0.1

All hashes shown in the above diagram are primal treehashes; all unions must be expressed in terms of these primal treehashes. Therefore, it would be valid to request the union of:

3b226dd64bf2c56ed76912182f7388fd3c28838d
2424fac4ebaedc111308e12465363e640cd1b7dd

Which would result in the following union tree:

90a3a8c35da0eab2c30f33c699b42b3da8555263 Antic_jll
5981c69027c66fbbc08fab118231375795d5c7d7 ├── include
a10deb59bddaa1afe8247e38708073338abb16d1 │   └── antic
3b226dd64bf2c56ed76912182f7388fd3c28838d │       └── nf.h
2424fac4ebaedc111308e12465363e640cd1b7dd └── lib
c1d4423e4e089a0890f3f5d9677d7bbd56388423     ├── libantic.so -> libantic.so.0.0.1
c1d4423e4e089a0890f3f5d9677d7bbd56388423     ├── libantic.so.0 -> libantic.so.0.0.1
35938f6e1b7765b32cf6c2b014c24de3cc116b00     └── libantic.so.0.0.1

Note that we now have a bunch of new hashes on the left, but these new hashes are not primal hashes, and they cannot be used to request trees from the Pkg server. This limitation reduces the number of hashes that the Pkg server must know about to simply N instead of 2^N (where N is the number of files/directories within the full content tree).

Pkg Client Implementation

The Pkg client makes its request via a GET /artifact/$hash/partial request, where $hash is the root primal hash (e.g. 151e8ff64bf82449ba700f35800ccf4dd7fa6c6b in the example above, and the hash that is recorded in Artifacts.toml files today). The GET request includes a body of partial hashes, the response to this request is a tarball containing at least those hashes, or an error. On error, a Pkg client may attempt to download the full artifact.

When a Pkg client receives a tarball, it must verify it. In keeping with the rest of the Pkg server web of trust, we do not require sha256 verification of tarballs from Pkg servers (we only require strict tarball identity when downloading from unknown servers), but we always verify content tree hashes. When verifying a partial content tree hash, the Pkg client will only verify that the requested tree hashes are present tree; not that the overall content tree matches exactly. This enables two core behaviors:

The Pkg server is allowed to serve more than was asked for. This leaves open the option for Pkg Servers to serve an extremely similar but not identical partial artifact, in the event that it already has one cached and ready to serve. This can be used to, for instance, bound latency for users by starting to generate a partial artifact, and if it's taking too long, serve the next closest artifact on disk.
The Pkg client is able to "merge" multiple partial artifacts from the same root content tree. Taking the JLL example above, if a user has already installed the "full" variant of an artifact on disk, that variant would satisfy both the "minimal" and "build" variants, despite the fact that the content tree is not the same. Similarly, taking the distributed analysis example above, if a worker node has already downloaded dataset pieces A, C and D, and next needs to analyze pieces A and B, the presence of C and D within the same content tree should not interfere. We therefore make the assertion that, within a single artifact, extra pieces of data that still belong to the overall root content tree are never a problem.

In order to properly manage and verify all of this, metadata is needed. We will maintain a TOML file alongside the partial artifact directories, keeping track of what was installed by us and the primal hashes of all elements within. The organization of files on-disk will be:

~/.julia/artifacts/<root_hash>/...
~/.julia/artifacts/<root_hash>.toml

In such a manner, Pkg.gc() performs its same sweeping as before, only if it should remove a content tree because it is no longer reachable, it must now remove the corresponding metadata .toml file as well. When performing artifact installation operations, if a .toml file is present, it is first loaded and checked to see what pieces of the artifact must actually be downloaded and added into the artifacts/<root_hash>/ directory. If an artifact installation operation ever installs all elements into an artifacts/<root_hash>/ directory such that it is now complete, the .toml file is deleted.

Pkg Server implementation

When fetching a new resource from the storage server, a Pkg server would record its primal hashes in a .toml file that is stored alongside the tarball. Thereafter, when servicing partial artifact requests, it would load the .toml file, match the requested primal hashes to content path prefixes, then generate a new tarball using Tar.rewrite() with a predicate that includes only files that match the set of path prefixes requested. The resultant tarball would then be cached under its own content hash. Note that finding imperfect matches (e.g. already-cached tarballs that contain a superset of the requested hashes) in this organizational scheme is not easy, but this does not bother us. We leave room open in the protocol for the server to be clever/lazy, but we don't require it.

Misc. notes

Partial artifacts are only implementable through Pkg Server cooperation; if a Pkg client is unable to contact a Pkg server, it should fall back to simply fetching the entire artifact, unless a keyword argument force_partial is set to true. This prevents e.g. worker nodes on a cluster from falling back to fetching a multi-GB file when they only wanted a few 10's of MBs of tarballs. In this case, we assert that a cooperative Pkg server is willing to serve us a partial artifact, and we error out if this is not the case.
JLL packages may record a set of subtree hashes for each artifact, denoting the list of hashes that correspond to the minimal and build trees. When installing a package, by default, they would request the minimal set of hashes from the Pkg server. (If the Pkg server is unavailable, falling back to the full set is non-fatal; it's slightly larger but functionally it serves, with the possible exception of LLVM_jll, which is truly enormous when debug info is included 😉).
We already have plans to store some metadata alongside package installs, so-called "skeleton tarballs". If these metadata repositories contained hashes, we could use those instead of the extra .toml files above. The time spent hashing will already be performed in all cases, as we always verify tarball content hashes when downloading a new resource, so it would be a matter of extracting that information from our hashing routines, and storing the results in a tarball efficiently.

DilumAluthge commented 4 years ago

Partial artifacts are only implementable through Pkg Server cooperation; if a Pkg client is unable to contact a Pkg server, it should fall back to simply fetching the entire artifact, unless a keyword argument force_partial is set to true. This prevents e.g. worker nodes on a cluster from falling back to fetching a multi-GB file when they only wanted a few 10's of MBs of tarballs.

What happens if force_partial is set to true and the Pkg client is unable to contact a Pkg server? Presumably, we will throw a fatal error in that case?

staticfloat commented 4 years ago

Presumably, we will throw a fatal error in that case?

Correct, I have updated the comment to make that more clear.

johnnychen94 commented 4 years ago

Partial artifacts are only implementable through Pkg Server cooperation; if a Pkg client is unable to contact a Pkg server, it should fall back to simply fetching the entire artifact

The BFSU mirror is a pure static https server in storage protocol with no advanced Pkg protocol feature, IIUC this is still compatible, except that Pkg client would always download the full tarball since fetching partial tarballs would get a 404 code?

staticfloat commented 4 years ago

This is only proposing an alteration between the Pkg client and the Pkg server; the storage server is untouched by this. The Pkg server would always be downloading the full tarball from the storage server, then serving smaller chunks of it to Pkg clients.

johnnychen94 commented 4 years ago

Let me clarify it a bit since this kind of mirror isn't documented anywhere in Pkg.

The intersection of the storage protocol and pkg protocol becomes an https server with no advanced pkg protocol features. This makes it possible for a pkg client to directly connect to a storage server to download tarballs (the status quo); the BFSU mirror is used in such way: JULIA_PKG_SERVER=https://mirrors.bfsu.edu.cn/julia/static/ julia.

We built BFSU mirror this way because mirror site maintainers refuse the Pkg server solution; as they declared, there are real performance issues serving a mirror via reverse-proxy manner (e.g., Nexus) even if there is a caching mechanism. Check the server status and you'll find it's of very high bandwidth and IO pressure.

IIUC, advanced pkg server features such as diff, partial artifacts will be provided via HTTP request in form of $pkg_server/artifacts/$hash-$diff and $pkg_server/artifacts/$hash. "If what the pkg client connects to is a storage server, it would always receive a 404 code in such cases, and then immediately falls back to downloading the full tarball from the server." <-- I just want to make sure this routine still works. BFSU works great and it's currently the best solution for users in mainland China.

if a Pkg client is unable to contact a Pkg server

If this means receiving a 404 when fetching partial artifacts, then it's great because pkg client can't distinguish between whether it's a pkg server or storage server in this case. But if it means fetching and checking some data preserved only to pkg server(e.g., /stats), then it would break the current routine.

staticfloat commented 4 years ago

I just want to make sure this routine still works. BFSU works great and it's currently the best solution for users in mainland China.

While you can currently point a Pkg client at a static server and use it like this, (and this design will not change that) I do not think we will be able to always guarantee this. There's too much that we want to do with the Pkg servers to always guarantee that a static server will be sufficient. We won't go out of our way to break it, but I do think it likely that eventually the BFSU solution is going to stop working.

The BFSU could still be the backing storage server that other Pkg servers use, but that would require someone else to host a Pkg server.

I expect by the time we get that far, we will have done the paperwork to get proper Chinese servers.

c42f commented 4 years ago

Thanks @staticfloat for pointing me toward this issue.

In distributed analysis, datasets are often split into multiple files, and each node may only need a small subset of files from the overall dataset. Having a single large artifact for the entire dataset is simple, but wasteful of time and bandwidth. Splitting the dataset into separate artifacts for each worker is inflexible, and splitting the dataset into thousands of small artifacts, allowing each worker to grab the ones it wants is bothersome.

One thing I think is interesting and difficult is dynamic partitioning in distributed compute jobs where there is no natural partitioning of the data. For example, the input to a job may be a single huge file (eg, a batch of data from a sensor, or even a big CSV file with time series data). In these cases you would often like to read a portion of the data per compute node, but this partitioning is a property of the compute available, not a property of the data.

It's hard to cleanly deal with this in a model where the partitions are a property of the data set. A workaround is to have a compute job which runs before the main compute and partitions the data by creating a new partitioned dataset which the distributed compute can work on. But for large datasets copying data like this isn't very satisfying.

johnnychen94 commented 4 years ago

One thing that might be related, is to also support partial package under the same mechanisim.

Like Interpolations.jl (https://github.com/JuliaMath/Interpolations.jl/issues/372), many packages have bundled their documentation altogether in one big repo and release them. And this could contribute to a large part of the download size, although they're still relatively smaller than artifacts.

The ]up part could be largely solved by the diff feature, if implemented.

staticfloat commented 4 years ago

That Interpolations.jl is an interesting case, but remember that the client needs to know the subset of primal hashes it wants to request from the server before it does anything else; so you'd need to record the set of primal hashes you want a "partial package" to consist of within the Registry, which is a pretty big change. I think in general we're probably going to rely upon the diff feature for this, which we are still working on.

mkitti commented 3 years ago

I'm currently maintaining Interpolations.jl and am now following this issue.

JuliaLang / Pkg.jl

Partial Artifacts #1873

Problem statement

Proposed solution

Specification of union trees

Pkg Client Implementation

Pkg Server implementation

Misc. notes