Lazy artifact without unpacking (non-tarball)

jonas-schulze commented 2 years ago

I would like to "deliver" some mat data set that I need during package testing as an artifact. The data set happened to be hosted already, though as is and not as a tar.gz. IIRC .mat support compression on their own, so wrapping them in a tar.gz feels odd.

How do I declare a lazy artifact (containing only a single file) that doesn't need to be unpacked?

If that's not possible (yet), I would like to propose to add a new keyword unpack (default: true which matches the current behavior) to the Artifacts.toml.

Somewhat related:

https://github.com/JuliaLang/Pkg.jl/issues/1467 (suggested to me that unpacking does not always happen)
https://github.com/JuliaLang/Pkg.jl/issues/1950 (would need the keyword as well)
https://discourse.julialang.org/t/creating-artifacts-toml-for-existing-tarball/33365

DilumAluthge commented 2 years ago

The DataDeps.jl package (https://github.com/oxinabox/DataDeps.jl) might be a good solution for your use case.

KristofferC commented 2 years ago

In theory we could look at the magic bytes to see if it is a gzipped file, otherwise, assume it is uncompressed..?

StefanKarpinski commented 2 years ago

Another layer of compression shouldn't really hurt though, and you can use gzip -1 to minimize the effort.

jonas-schulze commented 2 years ago

Ref https://github.com/oxinabox/DataDeps.jl/issues/113

KristofferC commented 2 years ago

Another layer of compression shouldn't really hurt though

But then you need to rehost the files.

simonbyrne commented 1 year ago

Agreed, there are many dataset hosting providers which expect you to upload the file directly, rather than uploading a tarball wrapping a file.

StefanKarpinski commented 1 year ago

If we allow artifacts to be arbitrary container and non-container formats with arbitrary compression schemes, there's really a never-ending stream of features that would have to be added, which is not something I think it's acceptable to do with a feature like artifacts that's built into the package manager.

Consider something apparently simple like allowing artifacts to be just a single file. This seems straightfoward enough: you just use the git blob hash of the file as its content address and put the file at the the artifact path instead of an extracted artifact directory like we do currently. So the path to this file will be something like ~/.julia/artifacts/a01fab9ad601903eaa0290a41c6a796525313337. However, many use cases of files require that the file name have a correct extension and a reasonable file name like data.mat. The current answer to that is genuinely simple if not always convenient: the artifact is a directory containing the single file data.mat. If we're trying to support an artifact being a single file with this extension/file name requirement, we'd need to start adding features: in this case an option to say that the actual path to the artifact is inside the usual top-level location at ~/.julia/artifacts/a01fab9ad601903eaa0290a41c6a796525313337/data.mat. But then the artifact isn't actually content-addressed anymore: you need to know the content hash and the path inside of the directory, and if two different artifact files referenced the same content address with a different hash, then they could extract the data to a different location. So even something simple like "let an artifact be a single file" leads to a whole can of worms. The simplest option is just to require it to be a directory, which is what we've done.

Different compression and container formats are more reasonable, imo, since they only complicate the model of how to deliver an artifact, rather than complicating the model of what an artifact is. The main issue with that is that Pkg needs to be able to extract other container formats. Julia is shipped with the dependencies required to decompress and extract tarballs, but we don't really want to add more dependencies to Julia for every format someone happens to want to use. But we could have a plugin system where a download stanza specifies a registered package/function for handling the content of the download stanza, and then lets the package acquire the artifact content however one wants.

For example, we could support downloading a single file something like this:

[data_mat]
git-tree-sha1 = "83f7499f0e79ac39a1a34d3e6ac119f5389ee66d"

    [[data_mat.download]]
    plugin = "FileArtifacts"
    url = "https://example.com/path/to/data.mat"
    sha256 = "ab2332e1005836afb236bf8515adf1b0522b640a51c9b8a401d64e3f5fc4478c"

What this would do is use the package called FileArtifacts (which must appear in the Project.toml file of the package where the Artifacts.toml file lives) to download the data_mat artifact. It would do the following:

Download the URL https://example.com/path/to/data.mat
Check that the SHA256 hash of the file is ab23...4478c
Save the file as data.mat (derived from the URL) in an empty directory
Compute the git-tree-sha1 of the directory (not the file) and make sure it's 83f7...e66d
Install the artifact directory at ~/.julia/artifacts/83f7...e66d

The end result is that data.mat can be found at ~/.julia/artifacts/83f7...e66d/data.mat. People could implement artifact downloaders for zip files, different compression formats, etc.

This is the way forward, but I'm not sure I really want to do this. Among other things, this would entail either not serving such artifacts through the package server system, or running arbitrary package code for artifact downloading in the package server system. Neither option is super appealing to me. We could maybe approve specific packages as "blessed" downloaders that we allow running on the package servers.

jonas-schulze commented 1 year ago

But then the artifact isn't actually content-addressed anymore: you need to know the content hash and the path inside of the directory, [...]

Isn't this exactly what is required now already from a user's perspective? In order to access anything from an artifact, the user has to joinpath(artifact"foo", "data.mat"). Here, artifact"foo" resolves to the content-addressed hash of the directory and data.mat is the object within the user is actually interested in.

[...] and if two different artifact files referenced the same content address with a different hash, then they could extract the data to a different location.

I think I don't quite understand what you mean. If two artifact files (does this refer to "descriptors", i.e. artifact"foo" and artifact"bar"?) refer to the same content, they will by design resolve to the same hash, won't they?

The considerations you described sound more like implementation details to me -- no offend. All I am asking for is an option to skip a certain part of the download/registration/creation process of an artifact, namely archive inflation. I am not questioning what an artifact is. An artifact remains a single file before and during download (a compressed or un-compressed tar-ball, or an arbitrary file) which becomes a content-addressed directory. This doesn't change at all. And from a user's perspective it doesn't change either. The user shouldn't need to care how the content-hash comes to be, because a user never gets in touch with it anyway. This is a detail hidden within artifact"foo", as it should be.

JuliaLang / Pkg.jl

Lazy artifact without unpacking (non-tarball) #2764