JuliaIO / Tar.jl

TAR files: create, list, extract them in pure Julia
MIT License
79 stars 19 forks source link

optimization: use sendfile in create and extract #33

Open StefanKarpinski opened 4 years ago

StefanKarpinski commented 4 years ago

It would be faster to use sendfile or equivalent for the data transfer part of tarball creation and extraction instead of a user-space buffered read/write loop. Relevant code that should be optimized:

https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/create.jl#L225-L231 https://github.com/JuliaIO/Tar.jl/blob/b8bd833254b48428f1ce0bf4/src/extract.jl#L288-L293

Keno commented 4 years ago

Hah, I was complaining about this to Elliot a few days ago, since we're using Tar.jl for the rr traces and the tar'ing up step is too slow. For some numbers, on my benchmark Tar.jl uses 60% of one core in addition to 100% of gzip. Regular tar uses about 6%, which probably suggests that whatever buffer size Tar.jl currently uses is too small. As you mentioned, the correct thing to do is to splice the file directly into the output I/O stream to have Tar.jl CPU utilization be 0.

Keno commented 4 years ago

And indeed using a faster compressor, like zstd, this becomes a bottleneck with Tar.jl taking 28s vs about 2s for regular tar.

StefanKarpinski commented 4 years ago

Using a bigger buffer would be pretty easy—currently it's 512 bytes, which is very small. But it feels very unnecessary to use a buffer here at all. Do we have an API that exposes sendfile? The other issue is when Tar.jl is used with TranscodingStreams and CodecZlib, in which case the destination (for create) or source (for extract) is not a real file handle anyway and what we'd want ideally is a way to have TranscodingStreams send the data directly to the output stream.

StefanKarpinski commented 4 years ago

I would also be ok with not using TranscodingStreams in performance-sensitive situations, creating JLLs for gzip and co instead (for portability) and then using sendfile to send data to/from the external gzip process without needing to pass through Julia's user space at all.

giordano commented 4 years ago

creating JLLs for gzip

I think I have it in a branch already, I just never opened the PR

StefanKarpinski commented 4 years ago

Having those as external programs via JLL would be nice in any case because doing compression/decompression via pipe if often both efficient and convenient.