bazelbuild / rules_pkg

Bazel rules for creating packages of many types (zip, tar, deb, rpm, ...)
Apache License 2.0
226 stars 175 forks source link

Duplicate file paths not accepted by pkg_tar #849

Closed eejayes closed 5 months ago

eejayes commented 7 months ago

When tar archives which contain duplicate file paths are assigned to deps of pkg_tar, the function removes all but the first occurring instance, and warns with "Duplicate file in archive picking first occurrence". This contrasts with the common GNU tar utility, where "tar allows you to have infinite number of files with the same name" according to document for --append under The Five Advanced tar Operations, where unpacking of the archive will yield the last occurring instance of the file path.

My use case involves doing tar archive augmentation through a series of build steps, where one of these operations is to overwrite files of an existing archive. It is not an option to unpack the archives, one over the other, and then repack the result, because new entries for parent directories of the files in the incoming archives will be created which state their permissions. This is because the state of the directories on the extraction target must be preserved, rather than set to those which were present in the environment where the augmentation took place. So if an archive with duplicate files is presented to pkg_tar where the one most recently added is meant to overwrite, it is actually the original, or first file, which is retained, and thus when unpacking the overwriting behavior is not achieved.

I feel that pkg_tar's behavior in this case should be compatible with GNU tar, in order to support interoperability and leverage from common understanding that it has probably established.

aiuto commented 7 months ago

Compatibliy with gnutar is not a goal we have been aiming for. We've done a lot of work to explicitly detect duplicate files because they are generally a sign that you didn't lay out the inputs correctly. Can you provide a more concrete example of what you are trying to with laying down multiple tar balls? Also, can you explain how the allow_duplicates_with_different_content attribute does not address your use case.

eejayes commented 7 months ago

What I am trying to do is combine two tar files, where one of them is regarded as having overwrite priority for all path conflicts. Before going into an example, my understanding of allow_duplicates_with_different_content is that the duplication corresponds to that occurring across dependencies, where the use case I consider is duplication within a pre-existing (dependency) archive. Also given that it's true default means that this option is being exercised when the problem occurs, it does not seem to be a solution. Also given the document, it seems like this is a deprecated interface. However I have not analyzed it's behavior, and could be missing something.

I will try to illustrate the problem here. For some diversity in tar utilities, I also exemplify bsdtars behavior.

Example "conflicting" input archives

$ tar tvf archive_a.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues

Concatenate with gnutar

$ tar --concatenate --file=archive_b.tar archive_a.tar
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock

Propagate concatenated archive through pkg_tar

BUILD

load("@rules_pkg//pkg:mappings.bzl", "pkg_files")
load("@rules_pkg//pkg:pkg.bzl", "pkg_tar")

pkg_files(
    name = "gnutar_concatenated_archive",
    srcs = [":archive_b.tar"],
)

pkg_tar(
    name = "duplicate_file_demo",
    deps = [":gnutar_concatenated_archive"],
)

Commands

$ bazel build //:duplicate_file_demo
$ tar tvf bazel-bin/duplicate_file_demo.tar
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./rock
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./blues
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:30 ./jazz

Compare unpacked results of gnutar and pkg_tar

$ mkdir inspect_pkgtar_output inspect_gnutar_output
$ tar xvf archive_b.tar -C inspect_gnu_tar_output/
$ tar xvf duplicate_file_demo.tar -C inspect_pkg_tar_output/
$ ls -l inspect_gnu_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 08:30 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 rock
$ ls -l inspect_pkg_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 09:21 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers    0 Apr 12 08:34 rock

When unpacking the archive created by tar --concatenate, for paths that are subject to conflict it is the last file, or most recently added file which prevails. This is what the overwrite use case requires. When unpacking the archive created by pkg_tar, the first file in the archive, or in other words the oldest prevails.

bsdtar concatenate and unpack behavior

While bsdtar is different from gnutar including it's CLI, the behavior is roughly the same. The same incoming archives are used in this example.

$ bsdtar c --file b_first.tar @archive_b.tar @archive_a.tar
$ tar tvf b_first.tar 
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ bsdtar c --file a_first.tar @archive_a.tar @archive_b.tar
$ tar tvf a_first.tar 
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues

$ mkdir extract_afirst extract_bfirst

$ bsdtar -x --file=b_first.tar -C extract_bfirst
$ ls -l extract_bfirst
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 rock

$ bsdtar -x --file=a_first.tar -C extract_afirst
$ ls -l extract_afirst/
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 rock
aiuto commented 7 months ago

Thanks for those examples. How about this proposed API

Add allow_dups_from_deps as an attribute to pkg_tar.

With that behavior, all existing tests should pass without change. A new test could easily be added. It would be best if you extended the verify_archive test with a content attribute. It would be a string keyed dict of path to content. Sort of like the verify_links parameter. Then 'rock' could have different content in the order or not.

A possible future idea is an option to allow deps processing before srcs (implying allow_dups_from_deps). That way you could start with a baseline .tar from somewhere else and overlay your own srcs on top of it. WDYT?

FWIW.... this is the second time in a month where no-sort-list has come up in a context like this. buildozer needs fixing to shift that declaration to the rule author, not the end user. But this is beyond the scope here.

eejayes commented 6 months ago

Yes that sounds like a good proposal, as it supports this new behavior and doesn't risk breaking the existing API. I don't know if it's a sensible use case for duplicates to be assigned to srcs (i.e. in the same pkg_tar instance). It's more likely that the contents change over time due to modifications. There your suggestion about processing deps first comes into play. This is also good.

FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.

aiuto commented 6 months ago

SGTM. Except my proposal about testing is not sufficient. You need to be able to specify that a file appears N times, and possibly what the content for each is.

FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.

Well.... if we want to get pedantic, I actually used tape drives before tar - and the pain of dealing with that never leave you. In those times, you usually had no file name at all on tapes. It was just data. More advanced schemes used a label at the beginning, specifying the name and sequence number (so you knew if it was tape 3 of a very large data set). The duplicate file name within the same physical unit was pretty much tar specific because most other backup schemes would rarely append into the existing set of files. Instead, they would skip to end of the written tape, and append a new set of data. That resulted in something like file1 EOF file2 EOF ... fileN EOF EOF fileX EOF, fileY EOF ... EOF EOF EOF. tar improved tape utilization by changing any single set of file EOF groups to a single file,file,file EOF group. You could put multiple tar outputs on a tape, but trying to back up over that last EOF and append into the previous one was a bit risky.

I like to remind people that MapReduce/Flume/Hadoops/Spark ... are not a recent invention. People dealt with data bigger than memory using tapes drives to read/process/sort/merge back in the 1950s.

aiuto commented 5 months ago

Believed fixed by #850