bazelbuild / rules_pkg

Bazel rules for creating packages of many types (zip, tar, deb, rpm, ...)
Apache License 2.0
217 stars 174 forks source link

pkg_files and pkg_tar produce different tarballs. #670

Closed AustinSchuh closed 1 year ago

AustinSchuh commented 1 year ago
pkg_files(
    name = "test-pkg-files-with-attributes",
    srcs = [ 
        "//tests:testdata/loremipsum.txt",
    ],  
    attributes = pkg_attributes(
        group = "1000",
        mode = "0440",
        user = "0",
    ),
    prefix = "/foo/bar",
)

pkg_tar(
    name = "test-pkg-tar-from-pkg-files-with-attributes",
    srcs = [ 
        ":test-pkg-files-with-attributes",
    ],  
)

pkg_tar(
    name = "test-pkg-tar-with-attributes",
    srcs = [ 
        "//tests:testdata/loremipsum.txt",
    ],  
    owner = "0.1000",
    package_dir = "/foo/bar",
)

I would expect both of these to create the same .tar

austin[169160] aschuh-3950x (main) ~/local/rules_pkg/tests/tar
$ tar tvf ../../bazel-bin/tests/tar/test_pkg_tar_with_attributes.tar; tar tvf ../../bazel-bin/tests/tar/test_pkg_tar_from_pkg_files_with_attributes.tar
drwxr-xr-x 0/1000            0 1999-12-31 16:00 foo/
drwxr-xr-x 0/1000            0 1999-12-31 16:00 foo/bar/
-r--r----- 0/1000          543 1999-12-31 16:00 foo/bar/loremipsum.txt
drwxr-xr-x 0/1000            0 1999-12-31 16:00 foo/
drwxr-xr-x 0/1000            0 1999-12-31 16:00 foo/bar/
-r--r----- 0/1000          543 1999-12-31 16:00 foo/bar/loremipsum.txt

But, the sha256 doesn't match.

austin[169202] aschuh-3950x (main) ~/local/rules_pkg/tests/tar
$ sha256sum ../../bazel-bin/tests/tar/test-pkg-tar-with-attributes.tar ../../bazel-bin/tests/tar/test-pkg-tar-from-pkg-files-with-attributes.tar
3567e6a8e5286da1052853fa4b754d57638625a1574aae48ad493152d3e2ef95  ../../bazel-bin/tests/tar/test-pkg-tar-with-attributes.tar
2f031ccaaf413a05f304ebeb3a29281e22de7028ca9b274fec2094f92e5cd20c  ../../bazel-bin/tests/tar/test-pkg-tar-from-pkg-files-with-attributes.tar

This isn't purely academic, the difference causes one of them to be installable for us with busybox tar, and the other to not have permissions which work.

austin[169162] aschuh-3950x (main) ~/local/rules_pkg/tests/tar
$ cat inspect_tar.py 
#!/usr/bin/python3

import tarfile
import sys
import os

a = tarfile.open(sys.argv[1])
m = a.getmember(sys.argv[2])
print('uid', m.uid)
print('gid', m.gid)
print('uname', repr(m.uname))
print('gname', repr(m.gname))
austin[169198] aschuh-3950x (main) ~/local/rules_pkg/tests/tar
$ ./inspect_tar.py ../../bazel-bin/tests/tar/test-pkg-tar-from-pkg-files-with-attributes.tar foo/bar/loremipsum.txt
uid 0
gid 0
uname '0'
gname '1000'
austin[169199] aschuh-3950x (main) ~/local/rules_pkg/tests/tar
$ ./inspect_tar.py ../../bazel-bin/tests/tar/test-pkg-tar-with-attributes.tar foo/bar/loremipsum.txt
uid 0
gid 1000
uname ''
gname ''

busybox tar is triggering off uid/gid, rather than uname/gname.

aiuto commented 1 year ago

Thanks for that report. I will confess that I don't know anything about busybox except that it is relevant in the big scheme of things. And... pkg_tar does some things with uid vs. uname, but it is just stuff, without a clear set of requirements.

Can you help define the requirements for how pkg_tar could produce things that are most widely consumable. I've been using tar forever, but basically only to untar with user perms so we ignore user information. I have not looked at #671 yet. I'll assume it is probably technically correct. What I would like to learn is what the right best practice for uid/uname is so that we can push that back into the documentation and/or the defaults for the rules.

AustinSchuh commented 1 year ago

@sarahn has much better opinions than me here :) I'll add the others on Monday.

aiuto commented 1 year ago

SGTM.....

nacl commented 1 year ago

pkg_files never supported numeric user/group ids, so it is not surprising that this is occurring. I would presume that Busybox is searching for the literal user/group 0/1000, which won't be present on any system. root/group-name-for-gid-1000 should presumably work, and theoretically be more portable.

I'm not immediately opposed to supporting directly passing uids/gids to pkg_files.

sarahn commented 1 year ago

root/group-name-for-gid-1000is hypothetically more portable, yes. However, with embedded systems and virtual machines, you might be extracting the tarball on a different system than is going to be booting it. Booting into another virtual system / chrooting to get the correct gid/group mapping can be a lot of work if that's the only reason you need to do it.

AustinSchuh commented 1 year ago

@nacl , I did some digging before going the uid/gid route, and it is in theory POSIX compliant to have a numeric username/group name, so auto detection is technically impossible. Some tools do auto-detection and mess this up, but most get it right. That suggests it may be a bad idea to use a numeric username, but it is technically compliant. It felt wrong for rules_pkg to have an opinion in that debate, hence the review. (I'm going to hold off on the review feedback until there's a bit more consensus here, but thanks for the prompt feedback!).

Stock tar does the following:

austin[169747] aschuh-3950x /tmp/foo
$ date > foo.txt
austin[169748] aschuh-3950x /tmp/foo
$ tar caf foo.tar foo.txt
austin[169749] aschuh-3950x /tmp/foo
$ tar tvf foo.tar
-rw-r--r-- austin/austin    32 2023-02-13 11:42 foo.txt
austin[169751] aschuh-3950x /tmp/foo                                                                                                    
$ python3                                                                                                                               
Python 3.9.2 (default, Feb 28 2021, 17:03:44)                                                                                            
[GCC 10.2.1 20210110] on linux                                                                                                           
Type "help", "copyright", "credits" or "license" for more information.                                                                   
>>> import tarfile                                                                                                                       
>>> t = tarfile.TarFile("foo.tar")            
>>> a = t.getmember("foo.txt")
>>> a.uid
1000
>>> a.gid
1000
>>> a.uname
'austin'
>>> a.gname
'austin'
>>>

It fills in both, and on extraction, tries the name first, then falls back to the uid/gid if the name doesn't exist (and extraction is done as root). https://serverfault.com/a/445504 lays out the logic pretty well.

I'm not 100% sure what the right behavior should be. I suspect most people won't care, and 0.0, root.root is the right answer for their needs.

For our use case which triggered this, we are trying to build a .tar of a root filesystem. Sarah has a good explanation above for why we need numeric IDs.

I did a quick survey of pkg_attributes users on github, and it looks like most users are only after permissions. The subset who set usernames are doing it for .deb packages, which should be uname/gname based so they can be installed on systems where the uid/gid aren't constant.

pkg_tar users are setting owner, but it is to either match the user in their container, or "not be root". Looks like most people don't really care.

I guess based on what I've been reading, we should recommend uname/gname for most cases, and uid/gid for "experts" doing more embedded/low level stuff. (And then document how gnu tar does it)

nacl commented 1 year ago

I'm thinking about the implementation, but my intuition is that given the esoteric nature of needing specific UIDs/GIDs, I would keep them out of the default manifest information but encode them as extra "optional" attributes. I haven't quite implemented the consumption of these yet, but we absolutely need this to share code with pkg_rpm.

@AustinSchuh:

I did some digging before going the uid/gid route, and it is in theory POSIX compliant to have a numeric username/group name, so auto detection is technically impossible.

Autodetection from what angle? pkg_attributes can effectively accept arbitrary information, as it rolls up everything it sees into JSON to be consumed by pkg_files and friends. Theoretically, autodetection could be done based on the difference between integer and string types, but given what I said above, this is probably not a good idea.

Setting both of them seems to be asking for all sorts of interesting edge cases.

Some tools do auto-detection and mess this up, but most get it right. That suggests it may be a bad idea to use a numeric username, but it is technically compliant.

Fair enough. I personally think that it is a rather bad idea myself

I'm not 100% sure what the right behavior should be. I suspect most people won't care, and 0.0, root.root is the right answer for their needs.

Neither am I, but most users indeed won't care. Unless you're extracting the files as a privileged user, user/group identity is ignored.

For our use case which triggered this, we are trying to build a .tar of a root filesystem. Sarah has a good explanation above for why we need numeric IDs.

For reference:

However, with embedded systems and virtual machines, you might be extracting the tarball on a different system than is going to be booting it. Booting into another virtual system / chrooting to get the correct gid/group mapping can be a lot of work if that's the only reason you need to do it.

Details of your use case aside, it seems like there are nontrivial recordkeeping pitfalls no matter how you do it. Regardless, the ask (UID/GID mappings) is valid.

I did a quick survey of pkg_attributes users on github, and it looks like most users are only after permissions. The subset who set usernames are doing it for .deb packages, which should be uname/gname based so they can be installed on systems where the uid/gid aren't constant.

Makes sense. We use pkg_rpm extensively and explicitly place individual user/group/permission settings all over the place.

pkg_tar users are setting owner, but it is to either match the user in their container, or "not be root". Looks like most people don't really care.

Indeed. I would guess that they are creating them for local user use.

I guess based on what I've been reading, we should recommend uname/gname for most cases

Agreed.

and uid/gid for "experts" doing more embedded/low level stuff. (And then document how gnu tar does it)

Seems like a good baseline, but at the same time there's the commonbsdtar and other varying versions of these tools.

aiuto commented 1 year ago

I'm not immediately opposed to supporting directly passing uids/gids to pkg_files.

Agreed.

Once you start looking across OSes, there is no consistent map beyond 0,0 == root,root, so I don't know what we could do that is generally useful.

Strawman changes to pkg_tar.

AustinSchuh commented 1 year ago

I'm thinking about the implementation, but my intuition is that given the esoteric nature of needing specific UIDs/GIDs, I would keep them out of the default manifest information but encode them as extra "optional" attributes. I haven't quite implemented the consumption of these yet, but we absolutely need this to share code with pkg_rpm.

Works for me. I did the minimal thing to get the point across, we can do better on round 2.

I did some digging before going the uid/gid route, and it is in theory POSIX compliant to have a numeric username/group name, so auto detection is technically impossible.

Autodetection from what angle? pkg_attributes can effectively accept arbitrary information, as it rolls up everything it sees into JSON to be consumed by pkg_files and friends. Theoretically, autodetection could be done based on the difference between integer and string types, but given what I said above, this is probably not a good idea.

Autodetection from the "is this a UID or a uname?" point of view.

Setting both of them seems to be asking for all sorts of interesting edge cases.

I mean, yes, it will, but we are setting a field in a header so the edge cases are from how the user uses it, not on the bazel side. Tar gives reasonable defaults and the ability to let the user do almost anything with those fields, so it seems like Bazel can/should allow for the same.

Seems like a good baseline, but at the same time there's the commonbsdtar and other varying versions of these tools.

Good point, it's worth capturing multiple implementations.

From Tony:

Strawman changes to pkg_tar.

  • If either of uname uid is unset, set the other to 0 or root as appropraite.
  • Same for gname, guid.
  • Optional.... if uname or gname is a number

    • silently use it as uid instead
    • use it as uid and warn you
    • fail unless uname is also set.

1 and 2 sound great. Since numeric user/groups are technically standards compliant, I'd be nervous about silently using it, or failing. Maybe something like:

if uname and/or gname is a number, and uid/gid is not set, warn and use it as uid? Or if uname/gname is a number, and uid/id is not set, just warn?

AustinSchuh commented 1 year ago

Gentle ping

aiuto commented 1 year ago

I've been OOO on vacation for the last 12 days. I'll look at this after I get through my backlog.

On Mon, Feb 27, 2023 at 2:59 PM Austin Schuh @.***> wrote:

Gentle ping

— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/rules_pkg/issues/670#issuecomment-1446995744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXHHHGBGCEG4NVZNUAD2E3WZUBT3ANCNFSM6AAAAAAUYKF7J4 . You are receiving this because you commented.Message ID: @.***>

AustinSchuh commented 1 year ago

No worries, hope you had/have a good break!

On Mon, Feb 27, 2023 at 1:53 PM aiuto @.***> wrote:

I've been OOO on vacation for the last 12 days. I'll look at this after I get through my backlog.

On Mon, Feb 27, 2023 at 2:59 PM Austin Schuh @.***> wrote:

Gentle ping

— Reply to this email directly, view it on GitHub < https://github.com/bazelbuild/rules_pkg/issues/670#issuecomment-1446995744 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAXHHHGBGCEG4NVZNUAD2E3WZUBT3ANCNFSM6AAAAAAUYKF7J4

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/rules_pkg/issues/670#issuecomment-1447159248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACHZGZZIEKZ3EINAFZ77WTWZUO47ANCNFSM6AAAAAAUYKF7J4 . You are receiving this because you were mentioned.Message ID: @.***>