bazelbuild / rules_pkg

Bazel rules for creating packages of many types (zip, tar, deb, rpm, ...)
Apache License 2.0
212 stars 166 forks source link

question about file ordering in pkg_tar #828

Open lukedirtwalker opened 4 months ago

lukedirtwalker commented 4 months ago

I'm wondering about the file order in pkg_tar. When I create a normal tar file via the linux command line the order of files in the tar is like specified on the command line:

touch {a,b}.txt; tar -cvf test.tar {b,a}.txt;  tar -tvf test.tar
# output
-rw-rw-r-- luke/luke         0 2024-03-07 08:29 b.txt
-rw-rw-r-- luke/luke         0 2024-03-07 08:29 a.txt

However if I do something similar with rules_pkg this doesn't apply:

# BUILD.bazel
load("@rules_pkg//:pkg.bzl", "pkg_tar")
pkg_tar(name = "a")
pkg_tar(name = "b")
pkg_tar(
    name = "test",
    srcs = [":b", ":a"],
)

# building and testing:
bazel build //:test; tar -tvf bazel-bin/test.tar
-r-xr-xr-x 0/0           10240 2000-01-01 01:00 a.tar
-r-xr-xr-x 0/0           10240 2000-01-01 01:00 b.tar

Is there any way to make this ordered the same as the srcs are specified?

One workaround I found is to pack a and b into an extra tar and then specify via deps:

pkg_tar(name = "a")
pkg_tar(name = "b")
pkg_tar(name = "a_tar", srcs = ["a"])
pkg_tar(name = "b_tar", srcs = ["b"])
pkg_tar(
    name = "test2",
    deps = [":b_tar", ":a_tar"],
)

# building and testing:
bazel build //:test2; tar -tvf bazel-bin/test2.tar
-r-xr-xr-x 0/0           10240 2000-01-01 01:00 b.tar
-r-xr-xr-x 0/0           10240 2000-01-01 01:00 a.tar

however this seems a bit cumbersome.

I assume the order gets lost here when we assign to the map https://github.com/bazelbuild/rules_pkg/blob/61132feb47add139967476401baa49cf4b7c6b2a/pkg/private/pkg_files.bzl#L493 ?

is that the expected/desired behavior?

aiuto commented 4 months ago

The intended behavior is that files are sorted by final location path rather than ordered by srcs.

Your example of using deps looks like a bug. In that case it should have unpacked the tarballs rather than including the entire file intact. But, let's not get distracted by that. Let's assume it was fixed and the files come out as b, a. The reason they do that is not from a true requirement. It is a side effect of the easiest implementation. While writing the result tarball, we open up b.tar and read individual files in and place them. It's fast, but has some footguns if a.tar and b.tar have intermixed file paths. If I had infinite time, I would like to build a pkg_untar() rule that let you specify a tarball as input, provide filtering and path remapping, and have that as input to pkg_tar. That could provide high performance with more flexibility. But that is a different feature request than yours.

We could make an FR for treating srcs in order via another attribute. There might even be someone willing to build it. It should not be that hard, but it does require some strict definition about the corner cases. There are several potential ones around auto-creating directories for something early in srcs when an explicit create with an owner and perms happens later. We could start with it failing hard if that happens. You could work around it by pre-creating them with pkg_dirs earlier in the list. In the more complex case, we would have yet another option to have it ignore the problem.

lukedirtwalker commented 4 months ago

Your example of using deps looks like a bug. In that case it should have unpacked the tarballs rather than including the entire file intact.

Note that I do double taring (just because I was to lazy to use genrule or assume external files are around), I think it's therefore correct that the final result contains tars.

In anycase I would consider to invest some time to build that feature (putting srcs in order). Would you have some ideas on how to name the attribute? I could draft a PR.

aiuto commented 4 months ago

Note that I do double taring ... Oh. I see what you did. That's a confusing sample.

Would you have some ideas on how to name the attribute? Maybe emit_srcs_in_list_order?

FWIW, you are going to have to fight with the rest of the ecosystem on that. Many projects use buildifier, which will alpha sort label lists like srcs. My hunch is that eventually you will find yourself modifying buildifier to suppress the sort on pkg_tar.

lukedirtwalker commented 4 months ago

Ah good point about buildifier. We use that as well. Hmm I'll have to think more about this.

lukedirtwalker commented 3 months ago

So I wonder if we should add an srcs_ordered attribute:

pkg_tar(
    name = "test-tar-ordered",
    srcs = [
        "//tests:testdata/config",
        "//tests:testdata/hello.txt",
        "//tests:testdata/loremipsum.txt",
    ],
    srcs_ordered = {
        "1": "//tests:testdata/hello.txt",
        "2": "//tests:testdata/loremipsum.txt",
        "3": "//tests:testdata/config",
    },
)

The semantics would be that the keys determine the order of the values. And the values need to match the srcs attribute (this would be checked). Would that make sense? That would be compatible with buildifier.

I'm not 100% how to integrate this in the _MappingContext, but probably this would also need some order notation added and it would need to be handled correctly in the tar binary.

aiuto commented 3 months ago

Please no. Unless you want to write an extensive specification for how the two will merge and how they are sorted. E.g. what is the output order of "1", "11", "2"? Obviously unchanged, because then you can insert a new thing in the middle of an existing list. Right? Oh, someone expects it strict numeric, so they try "1", "1.5", "2" and then all the code assuming ints breaks. Sorting is very easy, yet very hard do do in a way that seems right to everyone.

lukedirtwalker commented 3 months ago

Fair enough. I see that's problematic. I wonder how we could approach this that is buildifier friendly and not weird in some other way.

lukedirtwalker commented 3 months ago

Ah actually adding a # do not sort comment on the arguments prevents buildifier from sorting the lists. So we can simply have srcs with the comment # do not sort and a property emit_srcs_in_list_order = True, like so:

pkg_tar(
    name = "test-tar-ordered",
    srcs = [
       # do not sort
        "//tests:testdata/hello.txt",
        "//tests:testdata/loremipsum.txt",
        "//tests:testdata/config",
    ],
    emit_srcs_in_list_order = True,
)

Sorry I wasn't aware that it was that easy to disable the buildifier sorting, that's why I proposed the weird dict approach 🙈

aiuto commented 3 months ago

Yup. # do not sort is a viable tool for people needing this. I was never a fan of buildifier sorting things by default. Lists of things belong in the order people specified them. Text editors can sort if you want it.