chainguard-dev / malcontent

#supply #chain #attack #detection
Apache License 2.0
446 stars 31 forks source link

diff between two relative path files still shows up as deleted+added #599

Closed tstromberg closed 3 days ago

tstromberg commented 1 week ago

I think we still have our logic a bit off if you are passing two filenames as arguments:

go run ./cmd/mal diff ../malcontent-samples/linux/clean/ls.x86_64 ../malcontent-samples/macOS/clean/ls
├─ 🟡 Deleted: ls.x86_64 [MEDIUM]
│     ≡ data [LOW]
│       🟢 compression/lzma — works with lzma files
│     ≡ discovery [LOW]
│       🟢 system/hostname_get — get computer host name: gethostname
│     ≡ execution [LOW]
│       🟢 shell/TERM — Look up or override terminal settings: TERM
│     ≡ filesystem [LOW]
│       🟢 link_read — read value of a symbolic link: readlink
│     ≡ networking [LOW]
│       🟢 url/embedded — contains embedded HTTPS URLs:
│            https://gnu.org/licenses/gpl.html, https://translationproject.org/team/, https://wiki.xiph.org/MIME_Types_and_Fil…
│     ≡ process [MEDIUM]
│       🟡 name_set — get or set the current process name: __progname
│
├─ 🟢 Added: ls [LOW]
│     ≡ execution [LOW]
│       🟢 shell/TERM — Look up or override terminal settings: TERM
│     ≡ filesystem [LOW]
│       🟢 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
│       🟢 link_read — read value of a symbolic link: readlink

similar to the diff(1) command, this should show up as a single changed file.

egibs commented 5 days ago

I think this comes down to the added/removed logic. The first file will be stored in the map with the key ls and we'll use that to do the lookup for ls.x86_64 and it won't match so it will be treated as an add/remove rather than a modify.

Using the diff example above:

relPath: ls
fromMap: map[ls.x86_64:0xc0001a4140]

Here's where we store the relative path in the map: https://github.com/chainguard-dev/malcontent/blob/29fb875cbc017a6955a5001f8acb4d48141110f7/pkg/action/diff.go#L60

The lookup in processSrc: https://github.com/chainguard-dev/malcontent/blob/624befeed38fe9981ac3fbd8d9b0bb2bcaa13cae/pkg/action/diff.go#L128-L132

The lookup in processDest: https://github.com/chainguard-dev/malcontent/blob/29fb875cbc017a6955a5001f8acb4d48141110f7/pkg/action/diff.go#L126-L130

egibs commented 5 days ago

If I store the value of rel with any extension trimmed, this works as described above:

$ go run cmd/mal/mal.go diff ../malcontent-samples/linux/clean/ls.x86_64 ../malcontent-samples/macOS/clean/ls
├─ 🔵 Changed: ../malcontent-samples/macOS/clean/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
│       🔵 shell/TERM — Look up or override terminal settings: TERM
│     ≡ filesystem [LOW]
│       🔵 link_read — read value of a symbolic link: readlink
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Do we want to go this route? Two files of different types with the same name (sans extension) would show odd behavior drift but maybe we want that? Conversely, two files with different names may be the same underlying file and the diff would show as deleted/added when it would be more appropriate to show as changed.

tstromberg commented 3 days ago

FWIW, like diff(1), the file name shouldn't matter at all if two filenames are being passed into it. Filename logic only happens when directories are passed in.

tstromberg commented 3 days ago

I mentioned relative paths in the initial bug report, but this happens for absolute paths too:

go run ./cmd/mal/ diff ../malcontent-samples/linux/clean/ls.x86_64 /bin/ls

egibs commented 3 days ago

FWIW, like diff(1), the file name shouldn't matter at all if two filenames are being passed into it. Filename logic only happens when directories are passed in.

Ah okay, this clarifies things. I'll hack on that tomorrow.

egibs commented 3 days ago

Addressed via #628.