chainguard-dev / malcontent

#supply #chain #attack #detection
Apache License 2.0
446 stars 31 forks source link

Make diff behave like diff(1); report consistent behaviors #628

Closed egibs closed 3 days ago

egibs commented 3 days ago

With the clarification about diff(1) behavior in #599, I wanted to get something written up to address the current implementation gap.

This PR overhauls diff and tries to mimic what diff(1) does --

When diffing directories, the source file report is first compared to the destination report to identify matching files, followed by files only present in the source path. Afterward, the opposite is done to identify files that exist only in the destination path.

The processSrc, processDest, and fileDestination functions were confusing and I think handleFile does everything we need for "modified" files. Otherwise, we're just directly adding reports to the Added/Removed map.

I also started tracking consistent behaviors across modified files (originally called existing I think?) and I also updated the renderers to account for the new behaviors. Depending on the format, consistent behaviors will show with no + or -. In the terminal, consistent behaviors will show up as cyan; the updated diff test data also contains these behaviors.

Examples:

Two directories:

$ go run cmd/mal/mal.go diff ./out/chainguard-dev/malcontent-samples/python/clean/conda-build/ ./out/chainguard-dev/malcontent-samples/python/clean/fonttools/
├─ 🟡 Deleted: out/chainguard-dev/malcontent-samples/python/clean/conda-build/_load_setup_py_data.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(code,, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ impact [LOW]
│       🔵 remote_access/py_setuptools — Python library installer that evaluates arbitrary code: exec(code
│     ≡ networking [MEDIUM]
│       🟡 download — download files: not downloaded yet
│       🔵 url/embedded — contains embedded HTTPS URLs: https://numpy.org/doc/stable/reference/distutils_status_migration.html
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: compile(f.read()
│
├─ 🔵 Added: out/chainguard-dev/malcontent-samples/python/clean/fonttools/psLib.py []
│

Two relative directories:

$ go run cmd/mal/mal.go diff ../malcontent-samples/python/clean/hatch/ ../malcontent-samples/python/clean/idna/
├─ 🟡 Deleted: ../malcontent-samples/python/clean/hatch/migrate.py [MEDIUM]
│     ≡ discovery [MEDIUM]
│       🟡 system/environment — Dump values from the environment: os.environ.items()
│     ≡ execution [MEDIUM]
│       🟡 program — execute external program: subprocess.run([sys.executable, setup_py], env
│       🟡 remote_commands/code_eval — evaluate code dynamically using eval(): eval(value)
│     ≡ false-positives [LOW]
│       🔵 py_hatch — migrate py: '_HATCHLING_PORT_ADD_', literal_eval(value)
│     ≡ filesystem [LOW]
│       🔵 directory/list — lists contents of a directory: .listdir(
│       🔵 file/open — opens files: open(
│       🔵 symlink_resolve — resolves symbolic links: realpath
│     ≡ networking [MEDIUM]
│       🟡 download — download files: Download, download_url
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: f.read()
│       🔵 fd/write — writes to a file handle: f.write(output)
│     ≡ process [MEDIUM]
│       🟡 executable_path — gets executable associated to this process: sys.executable
│
├─ 🟡 Added: ../malcontent-samples/python/clean/idna/setup.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(open('idna, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ networking [LOW]
│       🔵 url/embedded — contains embedded HTTPS URLs: https://github.com/kjd/idna
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: ).read()
│

Two unrelated files:

$ go run cmd/mal/mal.go diff ../malcontent-samples/macOS/clean/ls ../malcontent-samples/linux/clean/ls.x86_64 
├─ 🟡 Changed: ../malcontent-samples/linux/clean/ls.x86_64 [LOW → MEDIUM]
│     ▲ data [NONE → LOW]
+++     🔵 compression/lzma — works with lzma files
│     ▲ discovery [NONE → LOW]
+++     🔵 system/hostname — get computer host name: gethostname
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
---     🔵 directory/traverse — traverse filesystem hierarchy
~~~     🔵 link_read — read value of a symbolic link
│     ▲ networking [NONE → LOW]
+++     🔵 url/embedded — contains embedded HTTPS URLs:
+++           https://gnu.org/licenses/gpl.html, https://translationproject.org/team/, https://wiki.xiph.org/MIME_Types_and_File_Extensions, https://www.gnu.org/software/coreutils/
│     ▲ process [NONE → MEDIUM]
+++     🟡 name_set — get or set the current process name: __progname
│

Two unrelated files in the same parent:

$ go run cmd/mal/mal.go diff ./out/chainguard-dev/malcontent-samples/linux/clean/ls.x86_64 ./out/chainguard-dev/malcontent-samples/macOS/clean/ls
├─ 🔵 Changed: out/chainguard-dev/malcontent-samples/macOS/clean/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
~~~     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Moving further down the directory structure:

$HOME/go/1.23.2/bin/mal diff linux/clean/ls.x86_64 macOS/clean/ls
├─ 🔵 Changed: macOS/clean/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
~~~     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Two directories that share a file of the same name:

$ go run cmd/mal/mal.go diff /tmp/old/ /tmp/new/
├─ 🟡 Deleted: /private/tmp/old/_load_setup_py_data.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(code,, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ impact [LOW]
│       🔵 remote_access/py_setuptools — Python library installer that evaluates arbitrary code: exec(code
│     ≡ networking [MEDIUM]
│       🟡 download — download files: not downloaded yet
│       🔵 url/embedded — contains embedded HTTPS URLs: https://numpy.org/doc/stable/reference/distutils_status_migration.html
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: compile(f.read()
│
├─ 🔵 Changed: /private/tmp/new/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Consistent archive diffs:

$ for i in (seq 1 10); go run cmd/mal/mal.go diff /tmp/py3.13-debugpy-bin-1.8.6-r1.apk /tmp/py3.13-debugpy-bin-1.8.7-r0.apk; end
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
egibs commented 3 days ago

Diffing archives is exhibiting inconsistent behavior so I need to fix that.

Edit: updated in 6180a11 (#628). Without this change, the files in each report were being compared as if they were single files rather than an extracted directory of files. With concurrent processing, each diff would show a [single] different file.

tstromberg commented 3 days ago

This is huge - thank you!

egibs commented 3 days ago

Will merge in a bit. Working on one last bug.