anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.32k stars 580 forks source link

Support file ownership when using file source #3345

Open adammcclenaghan opened 1 month ago

adammcclenaghan commented 1 month ago

What would you like to be added: Today, some of the catalogers support the concept of 'File Ownership', specifically catalogers which implement type FileOwner interface

For example, if I scan my DPKG directory using a directory source, artifact metadata contains entries on which files are owned by my DPKG installation. Take curl as an example:

syft -o syft-json dir:/var/lib/dpkg | jq '.artifacts[] | select(.name == "curl") | .metadata.files'

[
  {
    "path": "/usr/bin/curl",
    "digest": {
      "algorithm": "md5",
      "value": "fb9a88e8023f2fb2a0f475d1c85d8dcb"
    },
    "isConfigFile": false
  },
  {
    "path": "/usr/share/doc/curl/copyright",
    "digest": {
      "algorithm": "md5",
      "value": "39782ccc3532fee98360f19e317c6707"
    },
    "isConfigFile": false
  },
  {
    "path": "/usr/share/man/man1/curl.1.gz",
    "digest": {
      "algorithm": "md5",
      "value": "1326b53b4e64bf16ed6558a94496a0e8"
    },
    "isConfigFile": false
  },
  {
    "path": "/usr/share/zsh/vendor-completions/_curl",
    "digest": {
      "algorithm": "md5",
      "value": "1fe4ab18bfb8fe595c42534a37ab27a3"
    },
    "isConfigFile": false
  }
]

However, when scanning with file source, we see no file metadata associated with the DPKG installation

syft -o syft-json file:/var/lib/dpkg/status | jq '.artifacts[] | select(.name == "curl")'

{
  "id": "768c7f6773e9852e",
  "name": "curl",
  "version": "7.81.0-1ubuntu1.18",
  "type": "deb",
  "foundBy": "dpkg-db-cataloger",
  "locations": [
    {
      "path": "/status",
      "accessPath": "/status",
      "annotations": {
        "evidence": "primary"
      }
    }
  ],
  "licenses": [],
  "language": "",
  "cpes": [
    {
      "cpe": "cpe:2.3:a:curl:curl:7.81.0-1ubuntu1.18:*:*:*:*:*:*:*",
      "source": "syft-generated"
    }
  ],
  "purl": "",
  "metadataType": "dpkg-db-entry",
  "metadata": {
    "package": "curl",
    "source": "",
    "version": "7.81.0-1ubuntu1.18",
    "sourceVersion": "",
    "architecture": "amd64",
    "maintainer": "Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>",
    "installedSize": 444,
    "depends": [
      "libc6 (>= 2.34)",
      "libcurl4 (= 7.81.0-1ubuntu1.18)",
      "zlib1g (>= 1:1.1.4)"
    ],
    "files": []
  }
}

This makes sense since using a file source will cause the file resolver to only index the target file and its containing directory. So when the DPKG cataloger tries to resolve the 'Infos' directory after parsing the DPKG DB, the index will contain no entries & it will fail to resolve the file ownership metadata.

However, as a user, I do not know that I have missing metadata here unless I go and read the cataloger implementation and understand that it requires more than the scanned file to correctly populate its results.

I would like to start a discussion here regarding how feasible it would be to make catalogers 'aware' of the fact that they require > 1 file to successfully perform all of their work.

In the case of DPKG for example, if it knows that we're scanning using a file source, it could then perform a 'second pass' and attempt to index the Infos or status.d directories used to determine file ownership so that the resolver passed to findDpkgInfoFiles can find owned files despite using a file source.

Why is this needed: When I scan with file source, I'd like the catalogers to provide me with complete results even when a suitable cataloger requires more than one file to perform its work.

Additional context:

wagoodman commented 3 weeks ago

This is a specific (useful) example of an existing higher-level issue https://github.com/anchore/syft/issues/3213