pombredanne commented 7 years ago

To support #377 and other scan-based deduction and related refinements, an important step is to "classify" the files in the codebase being scanned. This would mean defining a few high level buckets and heuristics to classify a file in a bucket.

With such classification, smarter results could be provided: for instance the license of documentation files or build scripts does not have the same impact as the license of the main code (and may often not be part of a build or redistributed software as used in a system or app).

I am opening this up for discussion to define the classifications. I think there should be as few classifications as possible. They could be part of a hierarchy, but flat is probably better and simpler.

Here is a first shot at what these classes could be:

main code: would be all the code proper that is effectively built and used when a piece of code is used.
build scripts: such as Makefiles, poms, CMake lists, etc
test code: any code that is used for testing the main code, either unit or integration or else. In many cases this is stored in a tree separated from the main code and often this is not part of the build meant to be used, but instead invoked during a build step (make check, or similar, etc)
doc: any code and documents that are documenting the code and often may not be part of the built code. This often includes generated documents for API docs.
assets and media: such as images, video, sounds, fonts, etc. often used in GUI and web apps. They have often different licenses and origins from the main code.
dev tools: these are scripts, binaries, packages, etc present in the codebase but meant to be used for development and not production. Frequently, their provenance and license may have little impact on the resulting licensing of the main built code.
metadata/metafiles: such as package manifests, LICENSE or COPYING files, etc that are describing top-level information for a codebase or a subset of it.
generated code: such as when using a parser generator such as Bison/lex, some ORM such as Hibernate or some WSDL or else: this may represent a large volume of code at times and may not have directly identified provenance which needs to be traced to the "descriptor" used to generate the code. This may contain or not injected code plugs under various license (such as a bison skeletton)

Note that a file may end up in more than one class... not sure this would be a good thing.

Beside this classification, determining if some file is deployed or not deployed as part of a production build and built vs. not built is another topic altogether which would not be covered explicitly here.

steven-esser commented 7 years ago

I think pointing out or emphasizing metadata files like LICENSE or COPYING in scan results would be a great addition. When I am doing analysis of 3rd party stuff, these are the first things I look at and if a project takes the time to include these, they are almost always correct.

If these files appeared somewhere near the top of scan results where ever they are being viewed (html app or AboutCode manager), that would really be helpful during analysis.

steven-esser commented 7 years ago

@pombredanne Would this make more sense as another fileinfo scan field, or as an additional thing added on after the fact, like the scan_errors field is for each file?

pombredanne commented 7 years ago

@majurg sorry for the late reply! a fileinfo field makes the most sense

pombredanne commented 6 years ago

from @mjherzog #873 which is moved here instead

We currently have several "file type" fields returned from a scan:

Type: either File or Directory
MIME Type
File Type
Binary
Text File
Archive File
Media File
Source File
Script File
Package Type

For this topic, I will ignore Type since this just covers File vs Directory and focus on files only. We need some simpler way to identify the file type in one field to facilitate filtering in AboutCode Manager and other tools. MIME Type and File Type each have pros and cons.

-In many cases MIME Type seems more useful because it summarizes the type a bit more - e.g. "text/x-shellscript" is probably more useful than corresponding File Types like "Bourne-Again shell script, ASCII text executable" and "POSIX shell script, ASCII text executable" because I primarily want to find all of the script files (which often do not have an extension).

On the other hand MIME Type seems to use "application/octet-stream" as a catch-all The "octet-stream" subtype is used to indicate that a body contains arbitrary binary data.) is not very helpful .

It may be the case that we could get the best result with a new Summary File Type field where the possible values are: Binary, Archive, Text, Media, Source or Script, but I am not sure whether a scanning will resolve to only one of these values (presumably we have multiple fields today because of some overlap).

The primary use case is that I want to easily filter for Binary and Source code files which are the primary targets for analysis. The secondary use case is to easily filter for chunks like Script or Media files. This will also be important for filtering DeltaCode results to set up alerts/warnings for code files, but ignore or lower the priority of changes to Script or Media files.

I reviewed some scans and noticed many shell script files show up as Text rather than Script so the current identification of Script: true/false is not going to help much.

pombredanne commented 6 years ago

Something to consider is ClearlyDefined facets. It would be best to align classifications with these.

See https://github.com/clearlydefined/website/blob/2639d4ed878d199a2eb381fb3448d1b74875cd1f/src/components/FacetSelect.js#L10 and https://github.com/clearlydefined/clearlydefined/blob/8f58a9a216cf7c129fe2cf6abe1cc6f960535e0b/docs/clearly.md#facets

Also the notion of "scope" for dependencies is closely related. See https://github.com/heremaps/oss-review-toolkit/blob/master/model/src/main/kotlin/Scope.kt#L27

mjherzog commented 4 years ago

Some comments:

Fonts may need to be separate from other media/assets - they are almost always licensed.
We probably need to keep this separate from CD facets. Facets are not yet not well defined and do not know how this concept will evolve
The end goal is to support policies for analysis and conclusions where you have a primary Content Type that you can reference in a policy for what files you want to analyze and/or which licenses for a package are secondary because of the Content Type.

mjherzog commented 4 years ago

1754 Prototype new summary/primary Content Type prototype

viragumathe5 commented 4 years ago

@pombredanne I really want to comment on this and so to achieve this I think

From the last day, I was doing some searches on documentations of projects and all so if the directory is for documentation then it must have more files of markdown or HTML or YML files if the directory is of some plugins then it must have more numbers of script files and some tests which also includes some script files so we can create the datasets for the number of files which if get passed by that directory then we can decide the type of directory for that directory
Another way to do this is to check for all the types of format that specific files has and map them one by one from the directory for eg. let's say I am looking for the archive directory so I had created the script for it and I know all the format that can be given by any archive file then I will map them and check for it but still it will get crucial for us to state the type of the directory

So maybe 1st way would be easy to implement and sounds practical

aboutcode-org / scancode-toolkit

Proposal: high level file classification #426

1754 Prototype new summary/primary Content Type prototype