Open deitch opened 1 year ago
Interesting problem @deitch! The one issue I see is that Syft operates on .tar
files by extracting them to a temporary directory. If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive. Your suggestion of an option for something like --continue-on-duplicate
seems appropriate, or what would you think of something like --duplicate-tar-files first|last|<number>
to be able to specify behavior?
If this happens with duplicate files, at least one of the files will be lost and Syft won't be able to properly scan the entire archive.
That is true. And since both the first and second copies of abc
are in the archive, if syft only gets the latter, it would not be a complete scan.
I think that is better than the current "will not scan at all" behaviour.
Perhaps the best would be if it could just read the entire tar archive as a stream, rather than extracting it, but I understand that there may be issues with that, like following links? I don't really know.
Your suggestion of an option for something like --continue-on-duplicate seems appropriate, or what would you think of something like --duplicate-tar-files first|last|
to be able to specify behavior?
My instinct would be to give a WARN on duplicates but keep going, thus scanning the latest, with an option to fail-on-duplicates.
what would you think of something like --duplicate-tar-files first|last|
to be able to specify behavior
I am not sure it fits the expected behaviour. tar
behaviour is "last entry wins", which fits fine with syft taking the last one. Recognizing that it might be an incomplete sbom, because it misses others that were masked, a user might want to say, "hold on, if my sbom is incomplete, error out." I have a hard time seeing why someone might want to specifically pick one of the others. That would digress from normal tar behaviour (take last or optionally error out) without solving the "incomplete sbom" issue.
I think the user would want either to take last consistent with tar (default); or error out because of incomplete sbom (option). Anything else doesn't fit with either.
@deitch good point -- if a user extracts the tar, and the behavior is always that the last entry wins, I'd agree a warning here is probably sufficient as the first duplicate entry would essentially get overwritten. I'll bring this up wit the team today and see if we can get some consensus -- if so, this sounds like a pretty simple change.
Is there anything I can do to help?
It looks like this is mostly fixed but not entirely.
If the file being replaced is a symlink, then it tries to follow the symlink and replace its target, rather than the link itself. This is an issue in archiver, not in syft per se, so I will open an issue there and link it here.
See the linked issue. syft still uses archiver/v3, which no longer is supported. v4 doesn't have this issue, but it requires more work on the consumer's part.
@deitch reopened this to update to archiver/v4
👍
Hi @kzantow following up on this one. Any success?
Please provide a set of steps on how to reproduce the issue
[0000] WARN file could not be unarchived: reading file in tar archive: file already exists: /tmp/syft-archive-contents-1809756098/abc No packages discovered
Application: syft Version: 0.59.0 JsonSchemaVersion: 4.1.0 BuildDate: 2022-10-17T16:13:44Z GitCommit: 41bc6bb410352845f22766e27dd48ba93aa825a4 GitDescription: v0.59.0 Platform: linux/amd64 GoVersion: go1.18.7 Compiler: gc
NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal