aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 536 forks source link

Do not scan certain data-only data file #602

Open pombredanne opened 7 years ago

pombredanne commented 7 years ago

I have mixed feelings on data files: I want to scan them as they may contain licenses or clues AND I do not want to scan them if they are highly likely not to contain any clues.

For the copyright detection that seeks possible dates and date ranges (such as Copyright (c) 2000-2013 XXXX), long list of numbers are a hog for now. There are other cases where scanning data files is not great especially when these are big files. Or for instance the firmwares are mostly hex blobs in the kernel. There I still want to scan them even if these may be huge hex blob-like: they have the most byzantine licenses of all and are worth reporting.

So we need a better way to skip scanning certain files that are pure data. And in these cases just issue a warning that they were not scanned fully and likely only scan the first and last hundred lines.

There are a couple patterns like binary-only data files which are easy to ignore. Or text files with lines made of digits, puncts and X (to catch hex and numbers lists). I wished I could use a quick entropy computation, but these lines may have a high entropy too. But they have in some cases a fixed width format that is always the same and may be a clue to rely on: e.g. a formatted text where the format of each line itself is the same.

One example of data files (e.g. *.data) is in http://http.debian.net/debian/pool/main/a/ask/ask_1.0.1.orig.tar.gz Other .hex files are found in the Linux kernel.

armijnhemel commented 7 years ago

The Intel hex files in the Linux kernel are easy to detect as they follow a fixed format:

https://en.wikipedia.org/wiki/Intel_HEX

This is easy to verify as in instance #426

pombredanne commented 7 years ago

@armijnhemel Thanks. Note that hex files are still something that needs scanning, especially in the kernel. See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/firmware/bnx2/bnx2-rv2p-06-6.0.15.fw.ihex#n360

Most hex firmwares there have the most byzantine and bizarre licensing. Several do not look like bona fide FLOSS licenses https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/firmware/emi26/bitstream.HEX#n4372 and are surely NOT available in a form suitable for further modifications... as an aside these binary-like, no-sources-available plugs look like a major security risk to me.

armijnhemel commented 7 years ago

true, but by at least identifying the hex files as hex files would allow you to better zoom in.

pombredanne commented 3 years ago

Here is the TODO:

  1. ensure that we detect common binary data file in hex form
  2. same for certificates #620

In term of implementation I suggest:

  1. This should be a contenttype flag such as is_binary_data
  2. later we can have ways to skip scanning these files
ashwanthdurairaj commented 3 years ago

Hi, I am trying to work on this issue. Are there any files in the repo which help the toolkit in ignoring scanning certain files?? Can anyone guide me to those files??