Open fredrikekre opened 3 years ago
Thoughts about API:
Would it be better to extract to an IO handle instead of producing a byte vector? After all, it's easy to convert an IO handle to a byte vector (just call read
on it) but if the file is large enough that you don't want to extract it all into memory then with this API you're stuck and you can't do that.
It might be more general to extract the contents of multiple files to an IO handle based on a predicate. There are times when you might just want to concatenate the contents of all the files (e.g. to search them). The pattern could be a function, a regex matching the path name, or just a string, in which case that must be exactly the path name.
If the path is a directory in the tarball, we could extract the contents of all files under that directory. Or we could error.
Would it be better to extract to an IO handle instead of producing a byte vector?
Yea, I thought about that too so I will update it.
It might be more general to extract the contents of multiple files to an IO handle based on a predicate.
Okay. You might risk reading more files than you actually intented if the predicate is not strict enough though.
True but if it's a string it will only ever match one file, which seems safe enough. We could return the number of files matched. Another thought I had was invoking a callback for each tarball entry with the IO handle of the data stream. But that may be a bit too low-level an API even though it is maximally flexible.
You mean that if you give str::String
as a predicate it would only be allowed to match one thing? In
$ tree .
.
├── dir
│ ├── elif.txt
│ └── file.txt
└── file.txt
1 directory, 3 files
would predicate = "file.txt"
only match /file.txt
?
Since the result is written to an io instead of beeing returned, perhaps the return value could be something like an array with matched files and number of bytes:
ret = [
("file1", nbytes1),
("file2", nbytes2),
]
That could be useful if you want to split the returned stream yourself.
Edit: It should just return all the headers that correspond to the file....
To me, giving a single file to extract makes the most sense. If you want to extract multiple files, you just write a loop over the (possibly filtered) files and pass in the same io
object?
Fair enough. Is there an urgent need for this? I'm on parental leave until at least next month.
No, there is no urgency.
Trying this out a bit in Pkg, it allocates 40GB to read all files in the registry and most of the time is spent on allocating buffer vectors. Should be a few low hanging fruits.
Edit: Lowest hanging fruit removed, next one seems to be allocations for all the headers.
This is what I came up with for use in Pkg: https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723
Here is it in action in Pkg: https://github.com/JuliaLang/Pkg.jl/pull/2431/. Ended up not using anything from this PR.
After trying to use this for a while, I don't think this API works very well. The problem is that to extract a single file you still need to parse through all headers (or at least until you find the file). So I think there needs to be some intermediate object created that does this for you and caches it (basically https://github.com/JuliaLang/Pkg.jl/pull/2431/files#diff-81abdf925a6c6e19622f1607e7777ae3bd3caabc942b73a7b33cbc0b96a9edff).
So my reading of that particular Pkg application is that rather than having a function to extract a single file at a time (calling that repeatedly makes the whole thing O(n^2)), you really want to have a way to iterate through the entire tarball and populate an in-memory representation of the tarball contents. I'd also rather avoid exposing special types like InMemoryFileSystem
and InMemoryFile
, and maybe have an API that converts the tarball into a dict or pair-vector of paths to byte vectors or something like that. Or even return a Vector{Pair{Path, SubArray{UInt8, 1}}
: we can extract the entire tarball into a single memory buffer and then create subvectors that refer to parts of that vector. The idea is that extracting the tarball into memory should be fast and then all you're really doing is building a simple index structure that refers to that extracted memory.
The reason for the way it is done in Pkg is to avoid having to read data that isn't used. So that's why it just reads the headers and remembers position + size and only reads the data (and caches it) on demand.
Regarding the special types, they are not meant to be "public", it was just a way I saw it convenient to structure the internals for the Pkg use case.
When we decompress the tarball, we've already read all the data, so we've paid the cost of doing that already. We might as well keep it around in memory — the only cost of doing that is the size of a decompressed General tarball, which is currently 25MB, not so big these days. If we use the SubArray{UInt8,1}
approach, we only need the original buffer where the tarball was extracted and we never need to copy it. I'll code it up and we can see how it performs.
When we decompress the tarball, we've already read all the data
The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.
The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.
I was planning to store it compressed to save disk space (yes, we use a ton of disk space, but every byte counts). If we're going to decompress it on disk, then we could use mmap, access it like an in-memory array and let the kernel decide when to page the contents in/out of memory. That seems easier and more efficient than seeking around the tarball I/O handle.
Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command. But when the registry is a single, immutable file, I think it makes more sense to cache the registry data structure more persistently. If the registry changes, it will be a different file entirely. So I think it makes sense to:
We could use a WeakRef to store the registry cache so that it gets cleaned up only when there is GC pressure. We could either use the size+mtime+ctime+inode of the compressed registry file as a cache key and reload when that changes or just flush the cache when some Pkg operation in this process has actually updated the registry and not otherwise. Or both.
Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command
After https://github.com/JuliaLang/Pkg.jl/pull/2438 the registry retrieved from PkgServer is considered immutable and it is "cached" through Pkg operations. It caches things based on the git-tree-sha of the registry.
Cache the parsed data TOML data as well
That's been done for a while with
It is useful for when the registry is a git repo since we don't cache the registry itself then, but for pkg server registries (with the new cache), it doesn't really do anything since it will never even attempt to read the same TOML file twice.
Regarding decompression / compression, I think I measured it to around 0.1 second to decompress so perhaps it fine to keep it decompressed.
Here is an example implementation + benchmark of keeping things compressed:
using Downloads, Tar, Pkg
using Pkg: PlatformEngines.exe7z
reg_url = "https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/28c601505a512da4e5abebc75cd448d48b8dac9c"
reg_file = "registry.tar.gz"
Downloads.download(reg_url, reg_file)
function untar_memory(tar_gz::AbstractString)
data = Dict{String, Vector{UInt8}}()
buf = Vector{UInt8}(undef, Tar.DEFAULT_BUFFER_SIZE)
io = IOBuffer()
open(`$(exe7z()) x $tar_gz -so`) do tar
Tar.read_tarball(x->true, tar; buf=buf) do hdr, _
if hdr.type == :file
Tar.read_data(tar, io; size=hdr.size, buf=buf)
data[hdr.path] = take!(io)
end
end
end
return data
end
using BenchmarkTools
@btime untar_memory(reg_file)
Takes about 100ms.
Just doing
@btime read(`$(exe7z()) x $reg_file -so`)
takes about 60 ms.
That seems like a good approach then, doesn't it? Storing the data as a Dict{String,String}
might be a little conceptually simpler and might be just as fast since String(take!(io))
is zero-copy.
I keep coming back to this seemingly simple feature and getting caught up in all the unfortunate corner cases.
What we want this to do is something equivalent to extracting the tarball somewhere temporary and then reading a single file out (or a set of files based on some criterion). Seems simple enough: read through the tarball and skipping over anything that's not the file you're interested in; when you find a header for the file that you want to extract, copy the data for that file to the given IO handle. And that simple strategy is what this PR implements.
So what's the problem? Unfortunately, tar files can have multiple entries for the same output path. If you're extracting the tarball, that's not an issue—you just extract each entry in order, overwriting the previous one with the latest one and the last one is left at the end. But what happens with this implementation? We end up extracting the contents of every version of the file to the IO handle, not just the last one. That's very much not the desired behavior of sending the contents of the file that would have been left after extraction. In order to make sure you extract only the last version of a file to the IO handle, you have to either be able to seek back and forth in the input tarball or you need to buffer the entire contents of the file you're extracting, which could take a lot of memory. If you want to not tax your RAM, it might be preferable to extract a selected set of paths to a temporary location and then read what you're interested in from that temp location ... which is exactly what Tar.extract
with a predicate does.
There's also the symlink issue, which is alluded to in the PR as a TODO note. This entails similar problems: if the symlink comes before the entry you want to extract in the tarball, then you can extract by scanning forward only, but if the symlink comes later on in the tarball, then you have to either go back and find the target file or have it buffered. This situation is worse since though since the target file could be any of the files in the tarball, so you'd either have to be able to seek or you have to buffer everything. Again, in order to avoid potentially blowing out memory, you'd want to extract to a temp location and then read the contents afterwards.
So in the case where your tarball is seekable, you can implement this efficiently by reading to the end, keeping track of where file's data can be found in the tarball, and then extracting the correct path contents at the end. However, a seekable tarball is atypical since tarballs are usually compressed so input is being read either from a command doing decompression or from a decompressor stream object—neither of which is seekable. In the non-seekable case, the best you can do is extract the entire tarball to a temporary directory and stream the contents of the paths you're interested in to the output IO stream.
Given the complications above, a better API might be Tar.extract_dict(tarball)
which extracts in-memory to a Dict{String, String}
like what Pkg uses. At least we know there's one very concrete use-case for that. Not that there aren't concrete use cases for Tar.extract_file
, it's just so challenging to implement efficiently because of the way the TAR format works. It's really not a good format.
Any chance that this might be included in Tar.jl soon? Part of my work involves sensitive data on systems with mainly shared file systems [1], so extracting this data to a temporary location on the file system is a no-go. I modified @KristofferC 's gist (https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723) to handle compression: https://gist.github.com/JBlaschke/ec0f4ece417bdfe1428a85c78c0804f7 and it works pretty well.
I ran some benchmarks where I inflated 187 MB to 1.2GB, analyzed a single file (~800MB) and moved onto the next archive. My application's ram footprint varied from ~3GB to ~7GB -- which is not great, but not terrible either.
I think I can safely say that the workflow: i) extract entire archive to Dict{String, String}
(overwriting same keys in the order of when the file was archived); ii) take the relevant data from (i); iii) move onto the next archive is fine by me.
[1]: now that I describe it, it sounds way cooler than it actually is.
Are you asking about the extract_dict
functionality or the originally proposed extract_file
functionality?
Are you asking about the
extract_dict
functionality or the originally proposedextract_file
functionality?
@StefanKarpinski I am asking for any functionality that will let me extract a given file from a tar archive into an io buffer. I see that there are quite a few corner cases that might make extract_file
not practical. So I observed, since many tar archives are also compressed, and the tar format itself requiring the entire file to be read even to extract a single file, that the extract_dict
is a reasonable compromise -- especially if it makes progress on this issue.
I've just come across this, and do let me know if I'm misinterpreting the situation, but it looks like the two main challenges are:
I may be missing something, but it seems to me that one potential approach is to have a Tar.scan(tarball) -> ScannedTarball
method where ScannedTarball
is a struct that records the positions and ranges of every file in the tarball. It would effectively be a wrapper type around a Dict{String, Pair{Int, Header}}
(or similar). From this, one could then get an IOStream for a particular file, or many files, correctly and efficiently, after paying the scan-the-whole-tarball cost once.
I do like the API of extracting the tarball to a dict, mapping paths that are files or symlinks to files to their content, and just not including non-files. That's already a functionality that Pkg is using and it seems generally useful. Not sure about extracting an index; as pointed out, since tarballs are usually compressed, you'd have to decompress everything up to the point of the file you want to extract anyway. One option that @staticfloat and I have been kicking around is to have a footer block that contains some magic and an index of where to find each object in the tarball: if the footer exists, then you can use it, if the footer doesn't exist, we could generate it once and then use that index, which I guess is roughly the proposed API. With regard to compression, it's actually possible to chunk the compression so that you can jump to the start of a specific item in a tarball without having to decode any of the previous chunks. It's a bit tricky, but we might be able to implement this special format, which has the advantage of allowing efficient random access file extraction while still remaining readable by the classic gzcat
and tar
commands.
The footer idea is interesting, and a quick search turns up some prior art:
Of course, this only works when you're dealing with a modified tarball. I'd think a progressive enhancement approach may be a good fit here — look for such a footer if it exists, if not fall back on scanning the tarball, with the option to create the footer as you suggest. For speed, I imagine an analogue of --fast-read
could be supported for lazy indexing when you know you're dealing with a "well-formed" tarball, by supporting a "partial scan" state where if a file is not found in the current index and the scan is incomplete, the scan continues till either the file is found or the end of the tarball reached.
This seems straightforward to me with an uncompressed tarball, but as you note a compressed tarball is harder. Chunked compression would help, but I wonder if it's possible to create "decompression checkpoints" over the course of the scanning that could be jumped to (for the non-chunked case). A bit like an iterate
-ish version of mark
(with regards to returning the state instead of modifying the underlying object).
I think there should be a (very) strong use case to start putting in non-standard stuff in the tar balls. The number of people in this issue over two years suggest to me that there isn't really such a strong case.
In any case, functionality like that in https://github.com/devsnd/tarindexer seems preferable to me where you have an external index since you don't mess with the tarball at all then.
For reference, the use case that's brought me to this issue is wanting to support being given a tarball in the form of an IO
, and then being able to get individual files within it as IO
without resorting to temp-files, in a way that's not awfully slow when done repeatedly.
@KristofferC, the tarball would work completely normally with other tools. You can insert what are effectively comment blocks in tarballs that contains this additional information but don't produce any files. The gzip format also has ways to insert "comments", although they're a little sneaky. Zstd has an official way to do it, which is even better.
Codecov Report
96.43% <100.00%> (+0.17%)
Continue to review full report at Codecov.