Open nigeltao opened 3 years ago
https://github.com/google/fuse-archive is an alternative implementation with linear (not quadratic) complexity.
I did some benchmarks to visualize this problem. The setup is simple: create an archive containing a single file and measure how long it takes to read that file. Repeat for different file sizes, tools, and compression backends.
Make a
.tar
archive containing a single large file,archivemount
it and copy out the single element in the archive:Notice that the "MB/s"
dd
throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straighttar
extraction takes less than a second.Sprinkling some logging in the
_ar_read
function inarchivemount.c
shows that thedd
leads to multiple_ar_read
calls. In the steady state, it readssize = 128 KiB
each time, with theoffset
argument incrementing on each call.Inside that
_ar_read
function, if we take theif (node->modified)
false branch, then for each call:archive_read_new
, even though it's the same archive every time.archive_entry
for the file. Again, this is once per call, even if it's conceptually the samearchive_entry
object used in previous calls.offset
bytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that_ar_read
is actually interested in.The total number of bytes produced by
archive_read_data
calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for.tar
files but probably worse for.tar.gz
files. The whole thing is reminiscent of the Shlemiel the Painter story.There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.