anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6k stars 553 forks source link

Support scanning filesystems without building an index #3145

Open ariel-miculas opened 3 weeks ago

ariel-miculas commented 3 weeks ago

What would you like to be added: My use case is to scan my host filesystem and get the Linux distro information alongside with the host packages and their versions. Unfortunately, indexing the entire filesystem takes too much time (~7 minutes to call directorySource.FileResolver from directory_source.go). I would like to avoid the indexing step and directly read the necessary files required by the os cataloger. I noticed there is an unindexed_directory.go file in the fileresolver package, but that's an internal package and the functions there cannot be used. Is there a plan to expose the functionality from unindexed_directory.go as a public API? Or is there another way to speed up the scanning process / avoid indexing the entire filesystem beforehand?

Why is this needed: This would bring a significant performance improvement to the filesystem scanning, especially when the filesystem is very large and thus building the entire index would take too much time.

Additional context:

wagoodman commented 3 weeks ago

It ends up that using the unindexed approach is much slower on large filesystems relative to the time it takes to index the filesystem. This is because using globs like **/package.json (and similar) means we need to have total knowledge of the paths on the system, or pay the penalty of making several syscalls to find this out (...and that cost for each glob searched for, which for syft is a lot!).

That being said, there are probably ways forward here, here are some high-level thoughts:

  1. Build partial indexes. The main idea is to be able to reference things outside of what is being cataloged, but not pay the cost of indexing the entire search space. This seems like it would be helpful with #2574 . The cons are that there are a lot of unknowns on how this would work
  2. Extend the index on-the-fly as-needed. This sounds good at first glance, but actually isn't really an option since the first ** glob that is used would require that the full search space is indexed.
  3. Speed up indexing with concurrency. @kzantow is working on that! It's not ready, so we'll keep exploring options here.
  4. [most promising] Allow the user to opt into non-** globs only and use the unindexed resolver. We'd want to be able to be loud in terms of logging/UI when a ** glob is used when in an "unindexed" mode. This has a few variants, such as: a. for select catalogers (primarily OS ones) have hard coded absolute paths to common DB locations (inflexible but simple) b. expose glob-level configuration for all catalogers, allowing the user to extend or override the globs that are searched for (flexible but potentially dangerous)

Assuming that one day we'll land 3 in the long-term and 4 in the near-term, I think 4 has the most promise and is really down to: how can we provide that functionality while keeping it simple, flexible and safe.

ariel-miculas commented 3 weeks ago

Thanks for your input! 4a seems the most promising start, since the OS catalogers already have parts of their paths to common DB locations hard-coded. It would also nicely fit my use case.

If I want the 4a feature implemented, what's the best course of action? Implementing it myself and sending a PR or is someone else interested in working on it?

Btw, what's the use case for searching in every path with ** globs? Is it to find packages in containers whose root file systems are extracted somewhere on the disk?

popey commented 3 weeks ago

We discussed this issue on our recent team live steam. One question that came up was whether you were aware of the dir option that syft has, for example:

syft scan dir:/var/lib/dpkg/
 ✔ Indexed file system                                                                                                                           /var/lib/dpkg
 ✔ Cataloged contents                                                                         7b17bf219a79afea6fe9e2246f855b09a89f50e773a06e231d2eb91ecad88359
   ├── ✔ Packages                        [5,259 packages]
   └── ✔ Executables                     [0 executables]
[0000]  WARN no explicit name and version provided for directory source, deriving artifact ID from the given path (which is not ideal)
NAME                                                         VERSION                                      TYPE
7zip                                                         23.01+dfsg-11                                deb
aardvark-dns                                                 1.4.0-5                                      deb
accountsservice                                              23.13.9-2ubuntu6                             deb
acl                                                          2.3.2-1build1                                deb
acpi                                                         1.7-1.3build1                                deb
acpid                                                        1:2.0.34-1ubuntu2                            deb
adb                                                          1:34.0.4-1build3                             deb
adduser                                                      3.137ubuntu1                                 deb
adwaita-icon-theme                                           46.0-1                                       deb
afnix                                                        3.8.0-1                                      deb
aglfn                                                        1.7+git20191031.4036a9c-2                    deb
aha                                                          0.5.1-3build1                                deb
aisleriot                                                    1:3.22.31-1build2                            deb
algol68g                                                     3.1.2-1                                      deb

etc.

If you know which distro (and, better, which package manager, and the index location), perhaps simply using syft scan dir:/path/to/package/index would be much faster as it won't require a full filesystem index.

This isn't to say we're not looking at the other options in an earlier post, just wanted to check.

ariel-miculas commented 3 weeks ago

Well, I would have hoped that syft would do the distro identification for me, and also the package manager identification and the related databases. Otherwise, I end up duplicating the logic that syft already has, which is not ideal. One thing that I did try was scanning /etc and hoping to get the distro information from syft, but that doesn't work:

$ syft scan -o syft-json /etc | jq '.distro'
 ✔ Indexed file system                                                                                                                                                                                                                /etc
 ✔ Cataloged contents                                                                                                                                                     2824684de3d1a19390ca88cf826e77c6f750657e552edb83d466666c37521a08
   ├── ✔ Packages                        [0 packages]
   └── ✔ Executables                     [0 executables]
[0000]  WARN no explicit name and version provided for directory source, deriving artifact ID from the given path (which is not ideal)
[0000]  WARN unable to access path="/etc/cups/ssl": open /etc/cups/ssl: permission denied
[0000]  WARN unable to access path="/etc/libvirt/secrets": open /etc/libvirt/secrets: permission denied
[0000]  WARN unable to access path="/etc/lvm/archive": open /etc/lvm/archive: permission denied
[0000]  WARN unable to access path="/etc/lvm/backup": open /etc/lvm/backup: permission denied
[0000]  WARN unable to access path="/etc/lvm/cache": open /etc/lvm/cache: permission denied
[0000]  WARN unable to access path="/etc/opt/duo/duo-desktop/https": open /etc/opt/duo/duo-desktop/https: permission denied
[0000]  WARN unable to access path="/etc/opt/duo/duo-desktop/localdata": open /etc/opt/duo/duo-desktop/localdata: permission denied
[0000]  WARN unable to access path="/etc/polkit-1/localauthority": open /etc/polkit-1/localauthority: permission denied
[0000]  WARN unable to access path="/etc/ssl/private": open /etc/ssl/private: permission denied
[0000]  WARN unable to access path="/etc/sudoers.d": open /etc/sudoers.d: permission denied
[0000]  WARN cataloger failed cataloger=linux-kernel-cataloger error=unable to get magic type for file: EOF location=/kernel-img.conf
{}

Maybe there's an issue that /etc/os-release is a symlink:

$ ls -la /etc/os-release
lrwxrwxrwx 1 root root 21 feb 14  2024 /etc/os-release -> ../usr/lib/os-release

So I thought I'll just scan /usr/lib, but that fails to identify the distro as well`:

$ syft scan -o syft-json /usr/lib | jq '.distro'
 ✔ Indexed file system                                                                                                                                                                                                            /usr/lib
 ✔ Cataloged contents                                                                                                                                                     5289cd0f221ee1de420da77318a67f1263ad9bd146c64a9b41544e06ed6678f8
   ├── ✔ Packages                        [15,785 packages]
   └── ✔ Executables                     [8,521 executables]
[0000]  WARN no explicit name and version provided for directory source, deriving artifact ID from the given path (which is not ideal)
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/sdk/8.0.108/FSharp/fsc.deps.json location=/dotnet/sdk/8.0.108/FSharp/fsc.deps.json
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/sdk/8.0.108/FSharp/fsi.deps.json location=/dotnet/sdk/8.0.108/FSharp/fsi.deps.json
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/sdk/8.0.108/MSBuild.deps.json location=/dotnet/sdk/8.0.108/MSBuild.deps.json
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/sdk/8.0.108/NuGet.CommandLine.XPlat.deps.json location=/dotnet/sdk/8.0.108/NuGet.CommandLine.XPlat.deps.js
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/sdk/8.0.108/dotnet.deps.json location=/dotnet/sdk/8.0.108/dotnet.deps.json
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/shared/Microsoft.AspNetCore.App/8.0.8/Microsoft.AspNetCore.App.deps.json location=/dotnet/shared/Microsoft
[0003]  WARN cataloger failed cataloger=dotnet-deps-cataloger error=unable to determine root package from deps.json file: /dotnet/shared/Microsoft.NETCore.App/8.0.8/Microsoft.NETCore.App.deps.json location=/dotnet/shared/Microsoft.NETCo
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ5018/hw1.0/m3_fw.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ5018/hw1.0/m3_fw.mdt" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ6018/hw1.0/m3_fw.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ6018/hw1.0/q6_fw.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ8074/hw2.0/m3_fw.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ8074/hw2.0/m3_fw.mdt" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/IPQ8074/hw2.0/q6_fw.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/ath11k/WCN6750/hw1.0/wpss.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/qcom/a530_zap.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/qcom/venus-1.8/venus.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/qcom/venus-4.2/venus.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/qcom/venus-5.2/venus.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
[0015]  WARN unable to process executable "/firmware/qcom/venus-5.4/venus.b00" error=unable to determine executable kind: unable to read first sector of file: EOF
{}

So it seems that I need to scan the entire filesystem if I want to get information about the distro.

Ideally I would do a syft scan with some flags that would identify the distro and give me the list of host packages and their versions, without indexing the entire filesystem or look at unnecessary files.

kzantow commented 2 weeks ago

To expand on option 1 a bit: the suggestion is instead of indexing the entire filesystem, we could potentially only index/catalog a specific subset of paths as described in this comment.

Why would we need this? Because today, Syft catalogers look in specific locations for certain things like linux OS distro info, but if a subdirectory is used, this is considered the "root" of the file scan. So, let's say we scan /var and there is a /var/lib directory, /lib would be considered a top-level and searching for /var/lib would not find anything. This ends up being fairly similar to the suggestion to adjust each individual cataloger pattern, I think.

ariel-miculas commented 2 weeks ago

Yes, I've also noticed this problem and I thought the --base-path command line option of the syft scan command would solve this and work the way you've described "the root of the file scan", but it does something else. I would love if I could do:

syft scan -o syft-json /etc --file-scan-root=/

and get the distro identification because the files in /etc/ are prefixed with /etc/, so syft would find /etc/os-release (assuming /etc/os-release is a regular file and not a symlink).

ariel-miculas commented 6 days ago

How do you feel about exposing a NewFromUnindexedDirectory function?

func NewFromUnindexedDirectory(dir string) file.WritableResolver {
    return NewFromUnindexedDirectoryFS(afero.NewOsFs(), dir, "")
}

Then I could do:

fileResolver := syft.NewFromUnindexedDirectory(sourcePath)
release := linux.IdentifyRelease(fileResolver)
log.Info().Msgf("found release %v", release)

to identify the Linux release without building the index of the entire filesystem.