alexcrichton / tar-rs

Tar file reading/writing for Rust
https://docs.rs/tar
Apache License 2.0
625 stars 184 forks source link

Add support for unpacking using custom Unpacker API #355

Open alessandrod opened 7 months ago

alessandrod commented 7 months ago

Hello!

This PR proposes introducing a new Unpacker trait to the API, and new unpack_with methods to Archive and Entry. The aim is to offer users a way to use specialized fs and IO operations to speed up unpacking.

Background: I work on a program that unpacks a really large .tar.zst (60G compressed, 188G uncompressed) at launch. Using tar-rs main which uses regular fs ops and io::copy, it takes about 4m30s to unpack. Using io_uring on linux, with pre-allocated huge page buffers etc, it takes about 1m30s uncompressing at ~1.8GB/s which is the max zstd can do.

This PR introduces the API I need to plumb io_uring, without it being specific to io_uring.

cgwalters commented 4 months ago

The high level goal makes total sense. That said though, is your software already doing decompression in a separate thread from the one doing unpacking?

If you just implement this software in the obvious way of putting a zstd decoder reader in front of tar::Archive, then a lot of parallelism is lost versus e.g. creating a Unix pipe or equivalent and two threads.

I'm curious, is your program also processing multiple entries asynchronously? IOW when you have multiple regular files in a row, you're writing them to disk simultaneously?

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally.

Or maybe instead of this PR, what you could do is maintain e.g. a tokio-tar-uring crate (that's a whole other thing in this, there's also https://docs.rs/tokio-tar/latest/tokio_tar/ and many people doing stuff like this want to do async too).

alessandrod commented 4 months ago

Thanks for taking a look! :)

The high level goal makes total sense. That said though, is your software already doing decompression in a separate thread from the one doing unpacking?

If you just implement this software in the obvious way of putting a zstd decoder reader in front of tar::Archive, then a lot of parallelism is lost versus e.g. creating a Unix pipe or equivalent and two threads.

I'm curious, is your program also processing multiple entries asynchronously? IOW when you have multiple regular files in a row, you're writing them to disk simultaneously?

Yes I'm writing multiple entries asynchronously, and I'm doing a bunch of other stat/link etc syscalls asynchronously with io-uring too. In my case the tarball is large and contains many entries (~2 million files), so syscall and fs overhead are significant.

I want to do writes and all other syscalls as fast as possible.

I have profiled what I'm doing extensively, wrote a custom profiler for it and gave a talk about it at a rustau meetup a while ago if you're interested https://www.youtube.com/watch?v=JX0aVnpHomk.

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally.

Not entirely sure I understand what you mean? Externally how?

Or maybe instead of this PR, what you could do is maintain e.g. a tokio-tar-uring crate (that's a whole other thing in this, there's also https://docs.rs/tokio-tar/latest/tokio_tar/ and many people doing stuff like this want to do async too).

Sure I can fork, but I'd rather not fork if I can avoid it.

cgwalters commented 4 months ago

First, bear in mind I have no “official” maintainer role on this repository. I just commented because I have other outstanding PRs here and I try to help.

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally.

Not entirely sure I understand what you mean? Externally how?

Maybe I am misunderstanding but it seems like you are changing the unpacking code here to have it drive a virtual file system API effectively.

But you can just use the existing ability to read the data directly from the tar archive and have your own code do the unpacking, right? The cost here would just be needing to handle some of the “normalization” (especially security related) that the code here does. That said IMO since you are only targeting Linux you can use openat2 RESOLVE_BENEATH anyways.

Sure I can fork, but I'd rather not fork if I can avoid it.

Yeah, of course. That said maybe what we really need in the ecosystem is a “sans-io” tar crate.

alessandrod commented 4 months ago

First, bear in mind I have no “official” maintainer role on this repository. I just commented because I have other outstanding PRs here and I try to help.

I figured, all help is welcome :)

This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally. > Not entirely sure I understand what you mean? Externally how? Maybe I am misunderstanding but it seems like you are changing the unpacking code here to have it drive a virtual file system API effectively. But you can just use the existing ability to read the data directly from the tar archive and have your own code do the unpacking, right? The cost here would just be needing to handle some of the “normalization” (especially security related) that the code here does. That said IMO since you are only targeting Linux you can use openat2 RESOLVE_BENEATH anyways.

It's not "just" normalization tho, unpack() does many things: https://github.com/alexcrichton/tar-rs/blob/main/src/entry.rs#L462

It preserves mtimes/ownerships handles tar extensions etc. Sure I can do it in my code, but I'm trying to avoid having to do exactly that and for everyone who wants to extract large tarballs efficiently to have to do it too.

Sure I can fork, but I'd rather not fork if I can avoid it. Yeah, of course. That said maybe what we really need in the ecosystem is a “sans-io” tar crate.

Yeah - isn't this PR a step in that direction? There's a default IO implementation, or you can bring your own via the Unpacker trait. I did call it unpack_with_io in an earlier revision 😂

My view is this: this crate is very popular and it does do IO today. It works well for reasonably large files. It doesn't work so well for very large files, and there's nothing a consumer of the API can do about it other than forking and becoming a tar expert. We could create another sans-io crate, but that wouldn't fix the issue that unpacking large files with this crate is not great, and that seems like a fixable thing not something that justifies a whole new crate.

If we had something like the Unpacker trait in this PR, then I would be happy to maintain the fast io-uring code in a separate crate. In this crate we could add docs to unpack() that say something like "The default implementation works for reasonably large files. If you have very large files, consider using unpack_with() with your own Unpacker or use the <to-be-created-tar-unpack-uring> crate on linux". Then an user of the crate could do something about large files without having to learn anything about tar itself.

Lastly, I concede that the PR is big, but it's mostly shuffling existing nested function definitions around. I can probably reduce the diff - I didn't spend too much time trying to make the perfect PR because I was seeking feedback on the general direction. And again, I do appreciate your feedback!

cgwalters commented 4 months ago

It preserves mtimes/ownerships handles tar extensions etc.

The default Entry reader is handling some tar extensions (e.g. at least GNU long links, etc.) unless you set raw mode.

As far as preserving mtimes and ownership - that's the thing, ultimately you as a consumer of this "custom unpacking api" would need to bridge between the mtimes provided and an io-uring or whatever other API anyways, right? It seems to me the main advantage of an "unpacker API" here is that the implementation would effectively be forced to handle all the tar attributes (or at least, explicitly ignore them).

As far as other tar extensions like xattrs; I think we should have a nice high level API to access them that's used by the unpacking code.

(This is very tangential but we have a different problem in that there's no API to build archives with some of those extensions)