anchore / syft

CLI tool and library for generating a Software Bill of Materials from container images and filesystems
Apache License 2.0
6.24k stars 574 forks source link

duplicate file in tar archive causes read to fail #1400

Open deitch opened 1 year ago

deitch commented 1 year ago

Please provide a set of steps on how to reproduce the issue

  1. Create a simple directory with just a few files. e.g.
    $ mkdir /tmp/syft
    $ cp $(which syft) /tmp/syft
    $ echo foo > /tmp/syft/abc
  2. Create a tar file with those contents:
    $ tar -C /tmp/syft -cvf /tmp/syft.tar
  3. Use syft to scan the dir, it all works:
    $ syft dir:/tmp/syft # or just `syft /tmp/syft`
  4. Use syft to scan the tar file, it all works:
    $ syft file:/tmp/syft.tar
  5. Add a duplicate file to the tar file
    $ tar -C /tmp/syft -rvf /tmp/syft.tar ./abc
  6. Scan the tar file, it fails
    
    $ syft file:/tmp/syft.tar
    ✔ Indexed /tmp/syft.tar
    ✔ Cataloged packages      [0 packages]

[0000] WARN file could not be unarchived: reading file in tar archive: file already exists: /tmp/syft-archive-contents-1809756098/abc No packages discovered


**What happened**:

syft refuses to scan a tar file when there are duplicate entries. Because of tar's sequential structure (it _is_ a tape archive 😁 ), this is legitimate. Further, when untarring a tar file, tar generally extracts later files over previous ones, unless explicitly set to fail on duplicate. However, syft fails outright.

**What you expected to happen**:

I expected it to process the tar file successfully. At the least, there should be options to fail-on-duplicate or continue-on-duplicate (the default tar behaviour)

**Anything else we need to know?**:

No.

**Environment**:
- Output of `syft version`:

Application: syft Version: 0.59.0 JsonSchemaVersion: 4.1.0 BuildDate: 2022-10-17T16:13:44Z GitCommit: 41bc6bb410352845f22766e27dd48ba93aa825a4 GitDescription: v0.59.0 Platform: linux/amd64 GoVersion: go1.18.7 Compiler: gc


(although I tried it with v0.63.0 via docker as well)

- OS (e.g: `cat /etc/os-release` or similar):

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal



(although I ran it inside the official syft docker images as well)
kzantow commented 1 year ago

Interesting problem @deitch! The one issue I see is that Syft operates on .tar files by extracting them to a temporary directory. If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive. Your suggestion of an option for something like --continue-on-duplicate seems appropriate, or what would you think of something like --duplicate-tar-files first|last|<number> to be able to specify behavior?

deitch commented 1 year ago

If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive.

That is true. And since both the first and second copies of abc are in the archive, if syft only gets the latter, it would not be a complete scan.

I think that is better than the current "will not scan at all" behaviour.

Perhaps the best would be if it could just read the entire tar archive as a stream, rather than extracting it, but I understand that there may be issues with that, like following links? I don't really know.

Your suggestion of an option for something like --continue-on-duplicate seems appropriate, or what would you think of something like --duplicate-tar-files first|last| to be able to specify behavior?

My instinct would be to give a WARN on duplicates but keep going, thus scanning the latest, with an option to fail-on-duplicates.

what would you think of something like --duplicate-tar-files first|last| to be able to specify behavior

I am not sure it fits the expected behaviour. tar behaviour is "last entry wins", which fits fine with syft taking the last one. Recognizing that it might be an incomplete sbom, because it misses others that were masked, a user might want to say, "hold on, if my sbom is incomplete, error out." I have a hard time seeing why someone might want to specifically pick one of the others. That would digress from normal tar behaviour (take last or optionally error out) without solving the "incomplete sbom" issue.

I think the user would want either to take last consistent with tar (default); or error out because of incomplete sbom (option). Anything else doesn't fit with either.

kzantow commented 1 year ago

@deitch good point -- if a user extracts the tar, and the behavior is always that the last entry wins, I'd agree a warning here is probably sufficient as the first duplicate entry would essentially get overwritten. I'll bring this up wit the team today and see if we can get some consensus -- if so, this sounds like a pretty simple change.

deitch commented 1 year ago

Is there anything I can do to help?

deitch commented 1 year ago

It looks like this is mostly fixed but not entirely.

If the file being replaced is a symlink, then it tries to follow the symlink and replace its target, rather than the link itself. This is an issue in archiver, not in syft per se, so I will open an issue there and link it here.

deitch commented 1 year ago

See https://github.com/mholt/archiver/issues/380

deitch commented 1 year ago

See the linked issue. syft still uses archiver/v3, which no longer is supported. v4 doesn't have this issue, but it requires more work on the consumer's part.

kzantow commented 1 year ago

@deitch reopened this to update to archiver/v4 👍

deitch commented 4 months ago

Hi @kzantow following up on this one. Any success?