codecov[bot] commented 3 years ago

Codecov Report

Merging #95 (17797c7) into master (37766a2) will increase coverage by 0.08%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
+ Coverage   96.37%   96.46%   +0.08%     
==========================================
  Files           4        4              
  Lines         634      650      +16     
==========================================
+ Hits          611      627      +16     
  Misses         23       23

Impacted Files	Coverage Δ
src/extract.jl	`96.43% <100.00%> (+0.17%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 37766a2...17797c7. Read the comment docs.

StefanKarpinski commented 3 years ago

Thoughts about API:

Would it be better to extract to an IO handle instead of producing a byte vector? After all, it's easy to convert an IO handle to a byte vector (just call read on it) but if the file is large enough that you don't want to extract it all into memory then with this API you're stuck and you can't do that.
It might be more general to extract the contents of multiple files to an IO handle based on a predicate. There are times when you might just want to concatenate the contents of all the files (e.g. to search them). The pattern could be a function, a regex matching the path name, or just a string, in which case that must be exactly the path name.

StefanKarpinski commented 3 years ago

If the path is a directory in the tarball, we could extract the contents of all files under that directory. Or we could error.

fredrikekre commented 3 years ago

Would it be better to extract to an IO handle instead of producing a byte vector?

Yea, I thought about that too so I will update it.

It might be more general to extract the contents of multiple files to an IO handle based on a predicate.

Okay. You might risk reading more files than you actually intented if the predicate is not strict enough though.

StefanKarpinski commented 3 years ago

True but if it's a string it will only ever match one file, which seems safe enough. We could return the number of files matched. Another thought I had was invoking a callback for each tarball entry with the IO handle of the data stream. But that may be a bit too low-level an API even though it is maximally flexible.

fredrikekre commented 3 years ago

You mean that if you give str::String as a predicate it would only be allowed to match one thing? In

$ tree .
.
├── dir
│   ├── elif.txt
│   └── file.txt
└── file.txt

1 directory, 3 files

would predicate = "file.txt" only match /file.txt?

fredrikekre commented 3 years ago

Since the result is written to an io instead of beeing returned, perhaps the return value could be something like an array with matched files and number of bytes:

ret = [
    ("file1", nbytes1),
    ("file2", nbytes2),
]

That could be useful if you want to split the returned stream yourself.

Edit: It should just return all the headers that correspond to the file....

KristofferC commented 3 years ago

To me, giving a single file to extract makes the most sense. If you want to extract multiple files, you just write a loop over the (possibly filtered) files and pass in the same io object?

StefanKarpinski commented 3 years ago

Fair enough. Is there an urgent need for this? I'm on parental leave until at least next month.

KristofferC commented 3 years ago

No, there is no urgency.

Trying this out a bit in Pkg, it allocates 40GB to read all files in the registry and most of the time is spent on allocating buffer vectors. Should be a few low hanging fruits.

Edit: Lowest hanging fruit removed, next one seems to be allocations for all the headers.

KristofferC commented 3 years ago

This is what I came up with for use in Pkg: https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723

Here is it in action in Pkg: https://github.com/JuliaLang/Pkg.jl/pull/2431/. Ended up not using anything from this PR.

KristofferC commented 3 years ago

After trying to use this for a while, I don't think this API works very well. The problem is that to extract a single file you still need to parse through all headers (or at least until you find the file). So I think there needs to be some intermediate object created that does this for you and caches it (basically https://github.com/JuliaLang/Pkg.jl/pull/2431/files#diff-81abdf925a6c6e19622f1607e7777ae3bd3caabc942b73a7b33cbc0b96a9edff).

StefanKarpinski commented 3 years ago

So my reading of that particular Pkg application is that rather than having a function to extract a single file at a time (calling that repeatedly makes the whole thing O(n^2)), you really want to have a way to iterate through the entire tarball and populate an in-memory representation of the tarball contents. I'd also rather avoid exposing special types like InMemoryFileSystem and InMemoryFile, and maybe have an API that converts the tarball into a dict or pair-vector of paths to byte vectors or something like that. Or even return a Vector{Pair{Path, SubArray{UInt8, 1}}: we can extract the entire tarball into a single memory buffer and then create subvectors that refer to parts of that vector. The idea is that extracting the tarball into memory should be fast and then all you're really doing is building a simple index structure that refers to that extracted memory.

KristofferC commented 3 years ago

The reason for the way it is done in Pkg is to avoid having to read data that isn't used. So that's why it just reads the headers and remembers position + size and only reads the data (and caches it) on demand.

Regarding the special types, they are not meant to be "public", it was just a way I saw it convenient to structure the internals for the Pkg use case.

StefanKarpinski commented 3 years ago

When we decompress the tarball, we've already read all the data, so we've paid the cost of doing that already. We might as well keep it around in memory — the only cost of doing that is the size of a decompressed General tarball, which is currently 25MB, not so big these days. If we use the SubArray{UInt8,1} approach, we only need the original buffer where the tarball was extracted and we never need to copy it. I'll code it up and we can see how it performs.

KristofferC commented 3 years ago

When we decompress the tarball, we've already read all the data

The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.

StefanKarpinski commented 3 years ago

The tarball is decompressed once (when the registry is downloaded). Not every time you read the registry.

I was planning to store it compressed to save disk space (yes, we use a ton of disk space, but every byte counts). If we're going to decompress it on disk, then we could use mmap, access it like an in-memory array and let the kernel decide when to page the contents in/out of memory. That seems easier and more efficient than seeking around the tarball I/O handle.

Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command. But when the registry is a single, immutable file, I think it makes more sense to cache the registry data structure more persistently. If the registry changes, it will be a different file entirely. So I think it makes sense to:

Store the registry compressed and decompress on first access;
Keep the decompressed data around for longer than a single Pkg operation;
Cache the parsed data TOML data as well since that can't change either and we're likely to parse the same things over and over again.

We could use a WeakRef to store the registry cache so that it gets cleaned up only when there is GC pressure. We could either use the size+mtime+ctime+inode of the compressed registry file as a cache key and reload when that changes or just flush the cache when some Pkg operation in this process has actually updated the registry and not otherwise. Or both.

KristofferC commented 3 years ago

Currently we redo all the work to load the registry on each Pkg operation, which made sense when it was possible for the registry contents to change out from under you with each command

After https://github.com/JuliaLang/Pkg.jl/pull/2438 the registry retrieved from PkgServer is considered immutable and it is "cached" through Pkg operations. It caches things based on the git-tree-sha of the registry.

Cache the parsed data TOML data as well

That's been done for a while with

https://github.com/JuliaLang/Pkg.jl/blob/b242fc1ff5de910e600c63db5ba8e022e1e9b83f/src/Registry/registry_instance.jl#L10-L13

It is useful for when the registry is a git repo since we don't cache the registry itself then, but for pkg server registries (with the new cache), it doesn't really do anything since it will never even attempt to read the same TOML file twice.

Regarding decompression / compression, I think I measured it to around 0.1 second to decompress so perhaps it fine to keep it decompressed.

KristofferC commented 3 years ago

Here is an example implementation + benchmark of keeping things compressed:

using Downloads, Tar, Pkg

using Pkg: PlatformEngines.exe7z

reg_url = "https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/28c601505a512da4e5abebc75cd448d48b8dac9c"
reg_file = "registry.tar.gz"

Downloads.download(reg_url, reg_file)

function untar_memory(tar_gz::AbstractString)
    data = Dict{String, Vector{UInt8}}()
    buf = Vector{UInt8}(undef, Tar.DEFAULT_BUFFER_SIZE)
    io = IOBuffer()
    open(`$(exe7z()) x $tar_gz -so`) do tar
        Tar.read_tarball(x->true, tar; buf=buf) do hdr, _
            if hdr.type == :file
                Tar.read_data(tar, io; size=hdr.size, buf=buf)
                data[hdr.path] = take!(io)
            end
        end
    end
    return data
end

using BenchmarkTools
@btime untar_memory(reg_file)

Takes about 100ms.

Just doing

@btime read(`$(exe7z()) x $reg_file -so`)

takes about 60 ms.

StefanKarpinski commented 3 years ago

That seems like a good approach then, doesn't it? Storing the data as a Dict{String,String} might be a little conceptually simpler and might be just as fast since String(take!(io)) is zero-copy.

StefanKarpinski commented 1 year ago

I keep coming back to this seemingly simple feature and getting caught up in all the unfortunate corner cases.

What we want this to do is something equivalent to extracting the tarball somewhere temporary and then reading a single file out (or a set of files based on some criterion). Seems simple enough: read through the tarball and skipping over anything that's not the file you're interested in; when you find a header for the file that you want to extract, copy the data for that file to the given IO handle. And that simple strategy is what this PR implements.

So what's the problem? Unfortunately, tar files can have multiple entries for the same output path. If you're extracting the tarball, that's not an issue—you just extract each entry in order, overwriting the previous one with the latest one and the last one is left at the end. But what happens with this implementation? We end up extracting the contents of every version of the file to the IO handle, not just the last one. That's very much not the desired behavior of sending the contents of the file that would have been left after extraction. In order to make sure you extract only the last version of a file to the IO handle, you have to either be able to seek back and forth in the input tarball or you need to buffer the entire contents of the file you're extracting, which could take a lot of memory. If you want to not tax your RAM, it might be preferable to extract a selected set of paths to a temporary location and then read what you're interested in from that temp location ... which is exactly what Tar.extract with a predicate does.

There's also the symlink issue, which is alluded to in the PR as a TODO note. This entails similar problems: if the symlink comes before the entry you want to extract in the tarball, then you can extract by scanning forward only, but if the symlink comes later on in the tarball, then you have to either go back and find the target file or have it buffered. This situation is worse since though since the target file could be any of the files in the tarball, so you'd either have to be able to seek or you have to buffer everything. Again, in order to avoid potentially blowing out memory, you'd want to extract to a temp location and then read the contents afterwards.

So in the case where your tarball is seekable, you can implement this efficiently by reading to the end, keeping track of where file's data can be found in the tarball, and then extracting the correct path contents at the end. However, a seekable tarball is atypical since tarballs are usually compressed so input is being read either from a command doing decompression or from a decompressor stream object—neither of which is seekable. In the non-seekable case, the best you can do is extract the entire tarball to a temporary directory and stream the contents of the paths you're interested in to the output IO stream.

StefanKarpinski commented 1 year ago

Given the complications above, a better API might be Tar.extract_dict(tarball) which extracts in-memory to a Dict{String, String} like what Pkg uses. At least we know there's one very concrete use-case for that. Not that there aren't concrete use cases for Tar.extract_file, it's just so challenging to implement efficiently because of the way the TAR format works. It's really not a good format.

JBlaschke commented 1 year ago

Any chance that this might be included in Tar.jl soon? Part of my work involves sensitive data on systems with mainly shared file systems [1], so extracting this data to a temporary location on the file system is a no-go. I modified @KristofferC 's gist (https://gist.github.com/KristofferC/b354b654e776f4910b03985830418723) to handle compression: https://gist.github.com/JBlaschke/ec0f4ece417bdfe1428a85c78c0804f7 and it works pretty well.

I ran some benchmarks where I inflated 187 MB to 1.2GB, analyzed a single file (~800MB) and moved onto the next archive. My application's ram footprint varied from ~3GB to ~7GB -- which is not great, but not terrible either.

I think I can safely say that the workflow: i) extract entire archive to Dict{String, String} (overwriting same keys in the order of when the file was archived); ii) take the relevant data from (i); iii) move onto the next archive is fine by me.

[1]: now that I describe it, it sounds way cooler than it actually is.

StefanKarpinski commented 1 year ago

Are you asking about the extract_dict functionality or the originally proposed extract_file functionality?

JBlaschke commented 1 year ago

Are you asking about the extract_dict functionality or the originally proposed extract_file functionality?

@StefanKarpinski I am asking for any functionality that will let me extract a given file from a tar archive into an io buffer. I see that there are quite a few corner cases that might make extract_file not practical. So I observed, since many tar archives are also compressed, and the tar format itself requiring the entire file to be read even to extract a single file, that the extract_dict is a reasonable compromise -- especially if it makes progress on this issue.

tecosaur commented 1 year ago

I've just come across this, and do let me know if I'm misinterpreting the situation, but it looks like the two main challenges are:

Having to scan the whole tarball to definitively find the record for a particular file
Needing to re-scan the tarball for each file when looking at multiple files one-at-a-time is needlessly expensive

I may be missing something, but it seems to me that one potential approach is to have a Tar.scan(tarball) -> ScannedTarball method where ScannedTarball is a struct that records the positions and ranges of every file in the tarball. It would effectively be a wrapper type around a Dict{String, Pair{Int, Header}} (or similar). From this, one could then get an IOStream for a particular file, or many files, correctly and efficiently, after paying the scan-the-whole-tarball cost once.

StefanKarpinski commented 1 year ago

I do like the API of extracting the tarball to a dict, mapping paths that are files or symlinks to files to their content, and just not including non-files. That's already a functionality that Pkg is using and it seems generally useful. Not sure about extracting an index; as pointed out, since tarballs are usually compressed, you'd have to decompress everything up to the point of the file you want to extract anyway. One option that @staticfloat and I have been kicking around is to have a footer block that contains some magic and an index of where to find each object in the tarball: if the footer exists, then you can use it, if the footer doesn't exist, we could generate it once and then use that index, which I guess is roughly the proposed API. With regard to compression, it's actually possible to chunk the compression so that you can jump to the start of a specific item in a tarball without having to decode any of the previous chunks. It's a bit tricky, but we might be able to implement this special format, which has the advantage of allowing efficient random access file extraction while still remaining readable by the classic gzcat and tar commands.

tecosaur commented 1 year ago

The footer idea is interesting, and a quick search turns up some prior art:

Of course, this only works when you're dealing with a modified tarball. I'd think a progressive enhancement approach may be a good fit here — look for such a footer if it exists, if not fall back on scanning the tarball, with the option to create the footer as you suggest. For speed, I imagine an analogue of --fast-read could be supported for lazy indexing when you know you're dealing with a "well-formed" tarball, by supporting a "partial scan" state where if a file is not found in the current index and the scan is incomplete, the scan continues till either the file is found or the end of the tarball reached.

This seems straightforward to me with an uncompressed tarball, but as you note a compressed tarball is harder. Chunked compression would help, but I wonder if it's possible to create "decompression checkpoints" over the course of the scanning that could be jumped to (for the non-chunked case). A bit like an iterate-ish version of mark (with regards to returning the state instead of modifying the underlying object).

KristofferC commented 1 year ago

I think there should be a (very) strong use case to start putting in non-standard stuff in the tar balls. The number of people in this issue over two years suggest to me that there isn't really such a strong case.

In any case, functionality like that in https://github.com/devsnd/tarindexer seems preferable to me where you have an external index since you don't mess with the tarball at all then.

tecosaur commented 1 year ago

For reference, the use case that's brought me to this issue is wanting to support being given a tarball in the form of an IO, and then being able to get individual files within it as IO without resorting to temp-files, in a way that's not awfully slow when done repeatedly.

StefanKarpinski commented 1 year ago

@KristofferC, the tarball would work completely normally with other tools. You can insert what are effectively comment blocks in tarballs that contains this additional information but don't produce any files. The gzip format also has ways to insert "comments", although they're a little sneaky. Zstd has an official way to do it, which is even better.

JuliaIO / Tar.jl

Tar.extract_file: extract content of single file in tarball #95

Codecov Report