aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 537 forks source link

Proposal: high level file classification #426

Open pombredanne opened 7 years ago

pombredanne commented 7 years ago

To support #377 and other scan-based deduction and related refinements, an important step is to "classify" the files in the codebase being scanned. This would mean defining a few high level buckets and heuristics to classify a file in a bucket.

With such classification, smarter results could be provided: for instance the license of documentation files or build scripts does not have the same impact as the license of the main code (and may often not be part of a build or redistributed software as used in a system or app).

I am opening this up for discussion to define the classifications. I think there should be as few classifications as possible. They could be part of a hierarchy, but flat is probably better and simpler.

Here is a first shot at what these classes could be:

Note that a file may end up in more than one class... not sure this would be a good thing.

Beside this classification, determining if some file is deployed or not deployed as part of a production build and built vs. not built is another topic altogether which would not be covered explicitly here.

steven-esser commented 7 years ago

I think pointing out or emphasizing metadata files like LICENSE or COPYING in scan results would be a great addition. When I am doing analysis of 3rd party stuff, these are the first things I look at and if a project takes the time to include these, they are almost always correct.

If these files appeared somewhere near the top of scan results where ever they are being viewed (html app or AboutCode manager), that would really be helpful during analysis.

steven-esser commented 7 years ago

@pombredanne Would this make more sense as another fileinfo scan field, or as an additional thing added on after the fact, like the scan_errors field is for each file?

pombredanne commented 7 years ago

@majurg sorry for the late reply! a fileinfo field makes the most sense

pombredanne commented 6 years ago

from @mjherzog #873 which is moved here instead

We currently have several "file type" fields returned from a scan:

For this topic, I will ignore Type since this just covers File vs Directory and focus on files only. We need some simpler way to identify the file type in one field to facilitate filtering in AboutCode Manager and other tools. MIME Type and File Type each have pros and cons.

-In many cases MIME Type seems more useful because it summarizes the type a bit more - e.g. "text/x-shellscript" is probably more useful than corresponding File Types like "Bourne-Again shell script, ASCII text executable" and "POSIX shell script, ASCII text executable" because I primarily want to find all of the script files (which often do not have an extension).

It may be the case that we could get the best result with a new Summary File Type field where the possible values are: Binary, Archive, Text, Media, Source or Script, but I am not sure whether a scanning will resolve to only one of these values (presumably we have multiple fields today because of some overlap).

The primary use case is that I want to easily filter for Binary and Source code files which are the primary targets for analysis. The secondary use case is to easily filter for chunks like Script or Media files. This will also be important for filtering DeltaCode results to set up alerts/warnings for code files, but ignore or lower the priority of changes to Script or Media files.

I reviewed some scans and noticed many shell script files show up as Text rather than Script so the current identification of Script: true/false is not going to help much.

pombredanne commented 6 years ago

Something to consider is ClearlyDefined facets. It would be best to align classifications with these.

See https://github.com/clearlydefined/website/blob/2639d4ed878d199a2eb381fb3448d1b74875cd1f/src/components/FacetSelect.js#L10 and https://github.com/clearlydefined/clearlydefined/blob/8f58a9a216cf7c129fe2cf6abe1cc6f960535e0b/docs/clearly.md#facets

Also the notion of "scope" for dependencies is closely related. See https://github.com/heremaps/oss-review-toolkit/blob/master/model/src/main/kotlin/Scope.kt#L27

mjherzog commented 4 years ago

Some comments:

mjherzog commented 4 years ago

1754 Prototype new summary/primary Content Type prototype

viragumathe5 commented 4 years ago

@pombredanne I really want to comment on this and so to achieve this I think

  1. From the last day, I was doing some searches on documentations of projects and all so if the directory is for documentation then it must have more files of markdown or HTML or YML files if the directory is of some plugins then it must have more numbers of script files and some tests which also includes some script files so we can create the datasets for the number of files which if get passed by that directory then we can decide the type of directory for that directory

  2. Another way to do this is to check for all the types of format that specific files has and map them one by one from the directory for eg. let's say I am looking for the archive directory so I had created the script for it and I know all the format that can be given by any archive file then I will map them and check for it but still it will get crucial for us to state the type of the directory

So maybe 1st way would be easy to implement and sounds practical