aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
114 stars 85 forks source link

Treat each layer in a docker images as a package of its own. #661

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

It would be useful to treat each layer in a docker images as a package of its own. Why? They are a thing that can be fetched individually and even if a single layer is not of much value alone, this can technically be used alone and when stored as a package (say in the purldb) this becomes something that can be reused (e.g., reuse the scan, analysis, etc.). Of course if we start treating each layer as a "package" the approach to combining the results of multiple overlaid layers would change as we would have possibly two ways of scanning a layer (and therefore two different scan contents:

  1. Scan a layer solo, in which case we may get many details, such as the details of all the system packages if any system package was installed in that layer. In effect the package databases contain everything: not only the packages currently installed, but also all the packages installed in previous layers
  2. Scan a layer in context, e.g., after scanning the previous layer and subtracting from that layer packages installed in previous layers (This is the current behaviour)

I am not sure a layer can ever be reused in abstract of its parent layer or rather not always as this would lead to aberrations, so there is some research to do there before committing to one or the other approach.

These would be some of the actual specific issues to work out:

silverhook commented 1 year ago

I agree with the layer as package approach.

My first thoughts are that I think typically results approach 2) would be more useful, but I agree that it may lead to aberrations. I haven’t formed my mind about that yet (and perhaps my technical skills are not at the point to do so either).

pombredanne commented 1 year ago

For reference see also

pombredanne commented 1 year ago

@silverhook you wrote:

I agree with the layer as package approach.

My first thoughts are that I think typically results approach 2) would be more useful, but I agree that it may lead to aberrations. I haven’t formed my mind about that yet (and perhaps my technical skills are not at the point to do so either).

The thing is that every layer may contain package databases or metadata for every layers below but a layer contains the actual installed package bits if and only if the package was installed in this specific layer. So the metadata duplication is an artifact of the layering, but it can also be subtle as it express itself in multiple ways: a package can be added, removed or updated (a remove/add in practice).

pombredanne commented 3 months ago

See also https://github.com/nexB/scancode.io/issues/1189

mjherzog commented 3 months ago

I am worried about: