Skip binaries files on filesystem scan

baruchiro commented 9 months ago

Steps to reproduce:

Build 2ms with go build -o 2ms main.go
Run a filesystem scan with ./2ms filesystem --path . --log-level debug
One of the scanned files is the ./2ms executable itself.
After ~4 minutes (it is a long time!) you will receive a lot of results from the binary.

There are two problems here:

The scan takes a very long time
There are a lot of false positives because the binary content generates sequences like secrets.

nargov commented 8 months ago

Hi,

I was thinking of tackling this one using this library. While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

What do you think?

baruchiro commented 8 months ago

I was thinking of tackling this one using this library. While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

@nargov from their documentation:

Only use libraries like mimetype as a last resort. Content type detection using magic numbers is slow, inaccurate, and non-standard

I don't want to harm our performance, this library at least makes us read each file twice.

I'm looking for an idea to reduce the binaries scans, but without huge performance issues on one hand, and without doing magics for the user on the other hand. For example, last time we saw this problem, we added the max-target-megabytes flag to skip large files. Here, the only thing I can think of, is to somehow measure the time of doing a task for a specific file, and warn in the log about a potential performance issue.

What do you think?

By the way, I'm sorry for the late response, I was sick. I appreciate your help!

nargov commented 8 months ago

As an alternative, I see https://pkg.go.dev/net/http#DetectContentType reads at most 512 bytes to detect the MIME type. Think it's good enough?

baruchiro commented 8 months ago

OK, I think we can create a POC for that. Here is what I'm thinking:

We should avoid reading the file twice! We need to reuse the []byte.
We need to decide which MIME types are ignored.
We need to be sure the MIME type identification is not leading to unexpected results (unexpected skipping files)
Do we want to allow controlling which MIME types will be skipped?
We need to test how it affects the performance.
Can we check if and how KICS handled this situation?

You don't have to answer all the questions before you start developing.

baruchiro commented 8 months ago

Another option will be to ignore lines that are too long. On one hand, they might be a binary file. But on the other hand, they can be a minified JS file.

Checkmarx / 2ms

Skip binaries files on filesystem scan #201

Steps to reproduce:

What do you think?