Treat each layer in a docker images as a package of its own.

pombredanne commented 1 year ago

It would be useful to treat each layer in a docker images as a package of its own. Why? They are a thing that can be fetched individually and even if a single layer is not of much value alone, this can technically be used alone and when stored as a package (say in the purldb) this becomes something that can be reused (e.g., reuse the scan, analysis, etc.). Of course if we start treating each layer as a "package" the approach to combining the results of multiple overlaid layers would change as we would have possibly two ways of scanning a layer (and therefore two different scan contents:

Scan a layer solo, in which case we may get many details, such as the details of all the system packages if any system package was installed in that layer. In effect the package databases contain everything: not only the packages currently installed, but also all the packages installed in previous layers
Scan a layer in context, e.g., after scanning the previous layer and subtracting from that layer packages installed in previous layers (This is the current behaviour)

I am not sure a layer can ever be reused in abstract of its parent layer or rather not always as this would lead to aberrations, so there is some research to do there before committing to one or the other approach.

These would be some of the actual specific issues to work out:

[ ] FetchCode: Fetch image/layer metadata from container registries
[ ] PurlDB: Identify and index base layers for common images in container registries for lookup and matching
[ ] ScanCode.io: Match container layers to PurlDB

silverhook commented 1 year ago

I agree with the layer as package approach.

My first thoughts are that I think typically results approach 2) would be more useful, but I agree that it may lead to aberrations. I haven’t formed my mind about that yet (and perhaps my technical skills are not at the point to do so either).

pombredanne commented 1 year ago

For reference see also

pombredanne commented 1 year ago

@silverhook you wrote:

I agree with the layer as package approach.

My first thoughts are that I think typically results approach 2) would be more useful, but I agree that it may lead to aberrations. I haven’t formed my mind about that yet (and perhaps my technical skills are not at the point to do so either).

The thing is that every layer may contain package databases or metadata for every layers below but a layer contains the actual installed package bits if and only if the package was installed in this specific layer. So the metadata duplication is an artifact of the layering, but it can also be subtle as it express itself in multiple ways: a package can be added, removed or updated (a remove/add in practice).

pombredanne commented 3 months ago

mjherzog commented 3 months ago

I am worried about:

confusing the content of a layer with a more traditional definition of a package.
distinguishing between layers that have something like package content and those that do not. There are often more layers with only permission or directory creation commands or similar without any installed software. More simply - some layers may look like packages, but many or most may not. I think that we should focus more on using package-like analysis where applicable for container analysis but avoid creating a new package type for this.

aboutcode-org / scancode.io

Treat each layer in a docker images as a package of its own. #661