Closed eejayes closed 5 months ago
Compatibliy with gnutar is not a goal we have been aiming for. We've done a lot of work to explicitly detect duplicate files because they are generally a sign that you didn't lay out the inputs correctly. Can you provide a more concrete example of what you are trying to with laying down multiple tar balls?
Also, can you explain how the allow_duplicates_with_different_content
attribute does not address your use case.
What I am trying to do is combine two tar files, where one of them is regarded as having overwrite priority for all path conflicts. Before going into an example, my understanding of allow_duplicates_with_different_content
is that the duplication corresponds to that occurring across dependencies, where the use case I consider is duplication within a pre-existing (dependency) archive. Also given that it's true
default means that this option is being exercised when the problem occurs, it does not seem to be a solution. Also given the document, it seems like this is a deprecated interface. However I have not analyzed it's behavior, and could be missing something.
I will try to illustrate the problem here. For some diversity in tar utilities, I also exemplify bsdtar
s behavior.
$ tar tvf archive_a.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
$ tar --concatenate --file=archive_b.tar archive_a.tar
$ tar tvf archive_b.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
load("@rules_pkg//pkg:mappings.bzl", "pkg_files")
load("@rules_pkg//pkg:pkg.bzl", "pkg_tar")
pkg_files(
name = "gnutar_concatenated_archive",
srcs = [":archive_b.tar"],
)
pkg_tar(
name = "duplicate_file_demo",
deps = [":gnutar_concatenated_archive"],
)
$ bazel build //:duplicate_file_demo
$ tar tvf bazel-bin/duplicate_file_demo.tar
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./rock
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:34 ./blues
-rw-r--r-- 1283467350/1283478178 0 2024-04-12 08:30 ./jazz
$ mkdir inspect_pkgtar_output inspect_gnutar_output
$ tar xvf archive_b.tar -C inspect_gnu_tar_output/
$ tar xvf duplicate_file_demo.tar -C inspect_pkg_tar_output/
$ ls -l inspect_gnu_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 08:30 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 rock
$ ls -l inspect_pkg_tar_output/
total 8
drwxr-xr-x 2 s0000184 linuxusers 4096 Apr 12 09:21 ./
drwxr-xr-x 4 s0000184 linuxusers 4096 Apr 12 09:20 ../
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 rock
When unpacking the archive created by tar --concatenate
, for paths that are subject to conflict it is the last file, or most recently added file which prevails. This is what the overwrite use case requires. When unpacking the archive created by pkg_tar
, the first file in the archive, or in other words the oldest prevails.
While bsdtar
is different from gnutar
including it's CLI, the behavior is roughly the same. The same incoming archives are used in this example.
$ bsdtar c --file b_first.tar @archive_b.tar @archive_a.tar
$ tar tvf b_first.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
$ bsdtar c --file a_first.tar @archive_a.tar @archive_b.tar
$ tar tvf a_first.tar
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:30 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./jazz
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:30 ./rock
drwxr-xr-x s0000184/linuxusers 0 2024-04-12 08:34 ./
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./rock
-rw-r--r-- s0000184/linuxusers 0 2024-04-12 08:34 ./blues
$ mkdir extract_afirst extract_bfirst
$ bsdtar -x --file=b_first.tar -C extract_bfirst
$ ls -l extract_bfirst
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 rock
$ bsdtar -x --file=a_first.tar -C extract_afirst
$ ls -l extract_afirst/
total 0
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 blues
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:30 jazz
-rw-r--r-- 1 s0000184 linuxusers 0 Apr 12 08:34 rock
Thanks for those examples. How about this proposed API
Add allow_dups_from_deps
as an attribute to pkg_tar.
deps
are processed in order, so users might need the buildifer no-sort-list option to prevent it from reordering things. With that behavior, all existing tests should pass without change. A new test could easily be added.
It would be best if you extended the verify_archive
test with a content
attribute. It would be a string keyed dict of path to content. Sort of like the verify_links
parameter. Then 'rock' could have different content in the order or not.
A possible future idea is an option to allow deps
processing before srcs
(implying allow_dups_from_deps
). That way you could start with a baseline .tar from somewhere else and overlay your own srcs
on top of it. WDYT?
FWIW.... this is the second time in a month where no-sort-list has come up in a context like this. buildozer needs fixing to shift that declaration to the rule author, not the end user. But this is beyond the scope here.
Yes that sounds like a good proposal, as it supports this new behavior and doesn't risk breaking the existing API. I don't know if it's a sensible use case for duplicates to be assigned to srcs
(i.e. in the same pkg_tar
instance). It's more likely that the contents change over time due to modifications. There your suggestion about processing deps
first comes into play. This is also good.
FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.
SGTM. Except my proposal about testing is not sufficient. You need to be able to specify that a file appears N times, and possibly what the content for each is.
FWIW, this behavior originates from the tape drive technology and specification, e.g.: tape archive duplicates.
Well.... if we want to get pedantic, I actually used tape drives before tar - and the pain of dealing with that never leave you. In those times, you usually had no file name at all on tapes. It was just data. More advanced schemes used a label at the beginning, specifying the name and sequence number (so you knew if it was tape 3 of a very large data set). The duplicate file name within the same physical unit was pretty much tar specific because most other backup schemes would rarely append into the existing set of files. Instead, they would skip to end of the written tape, and append a new set of data. That resulted in something like file1 EOF file2 EOF ... fileN EOF EOF fileX EOF, fileY EOF ... EOF EOF EOF. tar improved tape utilization by changing any single set of file EOF groups to a single file,file,file EOF group. You could put multiple tar outputs on a tape, but trying to back up over that last EOF and append into the previous one was a bit risky.
I like to remind people that MapReduce/Flume/Hadoops/Spark ... are not a recent invention. People dealt with data bigger than memory using tapes drives to read/process/sort/merge back in the 1950s.
Believed fixed by #850
When tar archives which contain duplicate file paths are assigned to deps of
pkg_tar
, the function removes all but the first occurring instance, and warns with "Duplicate file in archive picking first occurrence". This contrasts with the common GNU tar utility, where "tar allows you to have infinite number of files with the same name" according to document for--append
under The Five Advanced tar Operations, where unpacking of the archive will yield the last occurring instance of the file path.My use case involves doing tar archive augmentation through a series of build steps, where one of these operations is to overwrite files of an existing archive. It is not an option to unpack the archives, one over the other, and then repack the result, because new entries for parent directories of the files in the incoming archives will be created which state their permissions. This is because the state of the directories on the extraction target must be preserved, rather than set to those which were present in the environment where the augmentation took place. So if an archive with duplicate files is presented to
pkg_tar
where the one most recently added is meant to overwrite, it is actually the original, or first file, which is retained, and thus when unpacking the overwriting behavior is not achieved.I feel that pkg_tar's behavior in this case should be compatible with GNU tar, in order to support interoperability and leverage from common understanding that it has probably established.