canonical / chisel

GNU Affero General Public License v3.0
270 stars 42 forks source link

extract: Proper parent directory modes #74

Closed woky closed 10 months ago

woky commented 1 year ago

The purpose of parent directory handling in deb/extract.go is to create parent directories of requested paths with the same attributes (only mode as of now) that appear in the package tarball. However, the current implementation is not correct when glob paths are requested.

In what follows parent directory refers to a directory path that is not explicitly requested for extraction, but that is the parent of other paths that are requested for extraction, and so it is assumed to be implicitly requested for extraction.

Currently, whether a package path should be extracted is determined by the shouldExtract() function that iterates over requested paths and for each checks whether it matches the package path if it's glob, or if it's non-glob, whether it equals the package path or whether some of its target paths have the package path as the parent.

There are two problems with this implementation:

1) It only checks whether a package path is the parent of any target path of a requested non-glob path. It does not, and probably even cannot, check whether it is the parent of a requested glob path.

2) It iterates over the map of requested paths for every package path, even though for requested non-glob paths, it can match by directory lookup. And in each iteration, it checks whether a requested path is a glob by searching for wildcards in it.

This commit addresses mainly the first problem, but it touches the second one as well.

Track modes of directories as we encounter them in the tarball. Then, when creating a target path, create its missing parent directories with modes with which they were recorded in the tarball, or 0755 if they were not recorded yet. In the latter case, the directory mode is also tracked so that the directory can be recreated with the proper mode if we find it in the tarball later. This algorithm works for both glob and non-glob requested paths.

Since the matching that was previously done in the shouldExtract() function is not used anymore, the function was removed. As part of this change, the requested glob paths are recorded before the extraction begins into the dedicated list which is scanned only when requested non-glob path lookup fails.

We still match requested non-glob and glob paths separately. Ideally, we would use some kind of pattern matching on trees, something like a radix tree which also supports wildcards, but this commit is humble.

One consequence of this change is that when an optional path that doesn't exist in the tarball is requested, its parent directories are not created (if they are not the parents of other requested paths). But this new behavior is arguably more natural than the old behavior where we created parent directories for non-existent paths, which seems to have been just an artifact of the implementation. Therefore, one test had to be changed for this behavior change.

Since we do not allow optional paths to be defined in slices, this change is visible only to callers of the deb.Extract() function. In chisel, these callers are extract test and slicer. The latter depended on the old behavior to create parent directories for non-extracted content paths by including their direct parent directories in the requested paths. The behavior of chisel is preserved by changing slicer to include all, and not just direct, parent directories of non-extracted content paths in the requested paths.

woky commented 1 year ago

This PR depends on #75.

letFunny commented 10 months ago

Per our offline discussion, I think our best way forward is to close this PR and discuss with @rebornplusplus to see if it is a priority right now, and how to amend the code here to make it compatible with the latest changes. We will take that discussion offline and we can always re-open the PRs or create new ones as we see fit.