JuliaIO / Tar.jl

TAR files: create, list, extract them in pure Julia
MIT License
79 stars 19 forks source link

Not recording non-empty directories is problematic #103

Closed maleadt closed 3 years ago

maleadt commented 3 years ago

I'm tarring a rootfs directory that looks as follows:

$ ls -la                                                                                                                                                         
total 100K
drwxr-xr-x  17 tim tim 4,0K apr 16 16:48 ./
drwxr-xr-x 320 tim tim  32K apr 16 19:42 ../
lrwxrwxrwx   1 tim tim    7 apr 16 16:48 bin -> usr/bin/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 boot/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 dev/
drwxr-xr-x  37 tim tim 4,0K apr 16 16:48 etc/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 home/
lrwxrwxrwx   1 tim tim    7 apr 16 16:48 lib -> usr/lib/
lrwxrwxrwx   1 tim tim    9 apr 16 16:48 lib32 -> usr/lib32/
lrwxrwxrwx   1 tim tim    9 apr 16 16:48 lib64 -> usr/lib64/
lrwxrwxrwx   1 tim tim   10 apr 16 16:48 libx32 -> usr/libx32/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 media/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 mnt/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 opt/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 proc/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 root/
drwxr-xr-x   4 tim tim 4,0K apr 16 16:48 run/
lrwxrwxrwx   1 tim tim    8 apr 16 16:48 sbin -> usr/sbin/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 srv/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 sys/
drwxr-xr-x   2 tim tim 4,0K apr 16 16:48 tmp/
drwxr-xr-x  13 tim tim 4,0K apr 16 16:48 usr/
drwxr-xr-x  11 tim tim 4,0K apr 16 16:48 var/

I'll be focusing on the usr directory, which contains more directories and files (i.e. it's not empty):

$ ls -la usr                                                                                                                                                     
total 60K
drwxr-xr-x 13 tim tim 4,0K apr 16 16:48 ./
drwxr-xr-x 17 tim tim 4,0K apr 16 16:48 ../
drwxr-xr-x  2 tim tim  12K apr 16 16:48 bin/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 games/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 include/
drwxr-xr-x 20 tim tim 4,0K apr 16 16:48 lib/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 lib32/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 lib64/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 libx32/
drwxr-xr-x 10 tim tim 4,0K apr 16 16:48 local/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 sbin/
drwxr-xr-x 42 tim tim 4,0K apr 16 16:48 share/
drwxr-xr-x  2 tim tim 4,0K apr 16 16:48 src/

If I create a tarball with Tar.jl and GNU tar, and then list the tarball's contents using tar -tvf, there's some differences. One difference is that Tar.jl's tarball doesn't list the usr and bin directories themselves, because it doesn't encode those directories' metadata.

GNU Tar:

drwxr-xr-x tim/tim           0 2021-04-16 16:48 ./usr/
drwxr-xr-x tim/tim           0 2021-04-16 16:48 ./usr/bin/
-r-xr-xr-x tim/tim        7786 2021-04-16 16:48 ./usr/bin/lcf

Tar.jl:

-rwxr-xr-x 0/0            7786 1970-01-01 01:00 usr/bin/lcf

As per the README, this is (currently) by design. That does introduce issues with some tools, though, so I figured I'd create an issue documenting those. One of those tools is Docker, whose docker import tool can read a rootfs tarball to create a layer. However, lacking an entry for those directories, the permissions on those directories are "wrong", essentially rendering the created image useless for non-root uses:

julia> open(`docker import -`, "w", stdout) do io
         Tar.create("/home/tim/Julia/depot/artifacts/e0c005e71308ffa5b7a858ce435c1083fbb8702d", io)
       end
sha256:5d377e83f9734b0e4f4558b301bce3b045952cada76ac4c145e779fee5e37765
Process(`docker import -`, ProcessExited(0))

shell> docker run 5d377e83f9734b0e4f4558b301bce3b045952cada76ac4c145e779fee5e37765 ls -la /
total 56
drwxr-xr-x   1 root root 4096 Apr 20 10:55 .
drwxr-xr-x   1 root root 4096 Apr 20 10:55 ..
-rwxr-xr-x   1 root root    0 Apr 20 10:55 .dockerenv
lrwxrwxrwx   1 root root    7 Jan  1  1970 bin -> usr/bin
drwxr-xr-x   2 root root 4096 Jan  1  1970 boot
drwxr-xr-x   5 root root  340 Apr 20 10:55 dev
drw-------   1 root root 4096 Apr 20 10:55 etc
drwxr-xr-x   2 root root 4096 Jan  1  1970 home
lrwxrwxrwx   1 root root    7 Jan  1  1970 lib -> usr/lib
lrwxrwxrwx   1 root root    9 Jan  1  1970 lib32 -> usr/lib32
lrwxrwxrwx   1 root root    9 Jan  1  1970 lib64 -> usr/lib64
lrwxrwxrwx   1 root root   10 Jan  1  1970 libx32 -> usr/libx32
drwxr-xr-x   2 root root 4096 Jan  1  1970 media
drwxr-xr-x   2 root root 4096 Jan  1  1970 mnt
drwxr-xr-x   2 root root 4096 Jan  1  1970 opt
dr-xr-xr-x 319 root root    0 Apr 20 10:55 proc
drw-------   2 root root 4096 Apr 20 10:41 root
drw-------   4 root root 4096 Apr 20 10:41 run
lrwxrwxrwx   1 root root    8 Jan  1  1970 sbin -> usr/sbin
drwxr-xr-x   2 root root 4096 Jan  1  1970 srv
dr-xr-xr-x  13 root root    0 Apr 20 10:55 sys
drwxr-xr-x   2 root root 4096 Jan  1  1970 tmp
drw-------  13 root root 4096 Apr 20 10:41 usr
drw-------  11 root root 4096 Apr 20 10:41 var

shell> docker run 5d377e83f9734b0e4f4558b301bce3b045952cada76ac4c145e779fee5e37765 ls -la /usr
total 60
drw------- 13 root root  4096 Apr 20 10:41 .
drwxr-xr-x  1 root root  4096 Apr 20 10:55 ..
drw-------  2 root root 12288 Apr 20 10:41 bin
drwxr-xr-x  2 root root  4096 Jan  1  1970 games
drwxr-xr-x  2 root root  4096 Jan  1  1970 include
drw------- 20 root root  4096 Apr 20 10:41 lib
drwxr-xr-x  2 root root  4096 Jan  1  1970 lib32
drw-------  2 root root  4096 Apr 20 10:41 lib64
drwxr-xr-x  2 root root  4096 Jan  1  1970 libx32
drw------- 10 root root  4096 Apr 20 10:41 local
drw-------  2 root root  4096 Apr 20 10:41 sbin
drw------- 42 root root  4096 Apr 20 10:41 share
drwxr-xr-x  2 root root  4096 Jan  1  1970 src

As you can see, the permissions of usr and usr/bin are too restrictive. At the same time, usr/src is fine because it's an empty directory which is part of the tarball. Maybe the 'only if empty' rule should be extended to all directories so that tools like docker import don't stumble on Tar.jl's tarballs?

Note that docker import is probably to blame here too, as unpacking Tar.jl's tarballs using GNU Tar does result in proper permissions (but with weird timestamps as only usr/src was recorded by Tar.jl with a zero timestamp):

$ ls -la                                                                                                                                                                                                                   
total 0
drwxr-xr-x 17 tim  tim   460 apr 20 12:40 ./
drwxrwxrwt 28 root root  900 apr 20 13:01 ../
lrwxrwxrwx  1 tim  tim     7 jan  1  1970 bin -> usr/bin/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 boot/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 dev/
drwxr-xr-x 37 tim  tim  1,7K apr 20 12:40 etc/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 home/
lrwxrwxrwx  1 tim  tim     7 jan  1  1970 lib -> usr/lib/
lrwxrwxrwx  1 tim  tim     9 jan  1  1970 lib32 -> usr/lib32/
lrwxrwxrwx  1 tim  tim     9 jan  1  1970 lib64 -> usr/lib64/
lrwxrwxrwx  1 tim  tim    10 jan  1  1970 libx32 -> usr/libx32/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 media/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 mnt/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 opt/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 proc/
drwxr-xr-x  2 tim  tim    80 apr 20 12:40 root/
drwxr-xr-x  4 tim  tim   100 apr 20 12:40 run/
lrwxrwxrwx  1 tim  tim     8 jan  1  1970 sbin -> usr/sbin/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 srv/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 sys/
drwxr-xr-x  2 tim  tim    40 jan  1  1970 tmp/
drwxr-xr-x 13 tim  tim   260 apr 20 12:40 usr/
drwxr-xr-x 11 tim  tim   260 apr 20 12:40 var/

$ ls -la usr                                                                                                                                                                                                               
total 0
drwxr-xr-x 13 tim tim  260 apr 20 12:40 ./
drwxr-xr-x 17 tim tim  460 apr 20 12:40 ../
drwxr-xr-x  2 tim tim 8,5K apr 20 12:40 bin/
drwxr-xr-x  2 tim tim   40 jan  1  1970 games/
drwxr-xr-x  2 tim tim   40 jan  1  1970 include/
drwxr-xr-x 20 tim tim  440 apr 20 12:40 lib/
drwxr-xr-x  2 tim tim   40 jan  1  1970 lib32/
drwxr-xr-x  2 tim tim   60 apr 20 12:40 lib64/
drwxr-xr-x  2 tim tim   40 jan  1  1970 libx32/
drwxr-xr-x 10 tim tim  220 apr 20 12:40 local/
drwxr-xr-x  2 tim tim 2,5K apr 20 12:40 sbin/
drwxr-xr-x 42 tim tim  840 apr 20 12:40 share/
drwxr-xr-x  2 tim tim   40 jan  1  1970 src/
StefanKarpinski commented 3 years ago

This could be changed at the cost of two things:

  1. Slightly bigger tarballs (not really a big deal)
  2. Breaking reproducibility with past versions of Tar.create

I don't know if we're relying the tarballs that we generate being reproducible anywhere. cc @staticfloat?

StefanKarpinski commented 3 years ago

Oh, is there a umask set when you're doing this?

maleadt commented 3 years ago

Nope, just the default 022.

staticfloat commented 3 years ago

No, we don't need reproducibility yet. And if we did, we would likely check it through tree hashing, and not through sha hashing.

StefanKarpinski commented 3 years ago

Ok, I think we should go ahead and change this now, so I've made a PR which I'm going to merge.

StefanKarpinski commented 3 years ago

Oh, in the process I discovered a set of obscure bugs which I fixed: https://github.com/JuliaIO/Tar.jl/pull/105. I discovered this by accidentally changing the tarballs we produce to include each directory after the contents of that directory, which none of the tarball consuming functions handled correctly, each broken in a slightly different way. So we inadvertently also made this package more robust.