Open pombredanne opened 7 years ago
The Intel hex files in the Linux kernel are easy to detect as they follow a fixed format:
https://en.wikipedia.org/wiki/Intel_HEX
This is easy to verify as in instance #426
@armijnhemel Thanks. Note that hex files are still something that needs scanning, especially in the kernel. See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/firmware/bnx2/bnx2-rv2p-06-6.0.15.fw.ihex#n360
Most hex firmwares there have the most byzantine and bizarre licensing. Several do not look like bona fide FLOSS licenses https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/firmware/emi26/bitstream.HEX#n4372 and are surely NOT available in a form suitable for further modifications... as an aside these binary-like, no-sources-available plugs look like a major security risk to me.
true, but by at least identifying the hex files as hex files would allow you to better zoom in.
Here is the TODO:
In term of implementation I suggest:
is_binary_data
Hi, I am trying to work on this issue. Are there any files in the repo which help the toolkit in ignoring scanning certain files?? Can anyone guide me to those files??
I have mixed feelings on data files: I want to scan them as they may contain licenses or clues AND I do not want to scan them if they are highly likely not to contain any clues.
For the copyright detection that seeks possible dates and date ranges (such as Copyright (c) 2000-2013 XXXX), long list of numbers are a hog for now. There are other cases where scanning data files is not great especially when these are big files. Or for instance the firmwares are mostly hex blobs in the kernel. There I still want to scan them even if these may be huge hex blob-like: they have the most byzantine licenses of all and are worth reporting.
So we need a better way to skip scanning certain files that are pure data. And in these cases just issue a warning that they were not scanned fully and likely only scan the first and last hundred lines.
There are a couple patterns like binary-only data files which are easy to ignore. Or text files with lines made of digits, puncts and X (to catch hex and numbers lists). I wished I could use a quick entropy computation, but these lines may have a high entropy too. But they have in some cases a fixed width format that is always the same and may be a clue to rely on: e.g. a formatted text where the format of each line itself is the same.
One example of data files (e.g. *.data) is in http://http.debian.net/debian/pool/main/a/ask/ask_1.0.1.orig.tar.gz Other .hex files are found in the Linux kernel.