aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.07k stars 537 forks source link

Implement Alpine APKBUILD parser in packagedcode #2541

Open aalexanderr opened 3 years ago

aalexanderr commented 3 years ago

Short Description

Add Alpine's APKBUILD (apk package recipe) parser that would live in src/packagedcode/alpine_build.py and return a Package object.

Possible Labels

copyright scan email and url scan license scan

Select Category

Describe the Update

Alpine packages lack some necessary information to generate a compliance report (e.g. copyright, full license text, source code & patches). Those are avaliable only in the aports repository (each package has reference to commit sha in aports repo, specifically in the APKBUILD files. This code would be later used in scancode.io to create a pipeline that would get recipes for packages-> parse them and get source code, pathces, etc -> scan them & add missing information gathered from the package recipe & its code

How This Feature will help you/your organization

At ONAP we're trying to switch our images to Alpine, as it is a GPLv3 free base image (ONAP Technical Steering Committee decided to avoid GPLv3 as much as possible) This will be a brick towards having complete information about alpine pkgs in scancode.io to be able to generate compliance documentation.

Possible Solution/Implementation Details

One issue found so far is bash param subst being used in the recipes which needs to be handled.

Example/Links if Any

https://wiki.alpinelinux.org/wiki/APKBUILD_Reference https://wiki.alpinelinux.org/wiki/APKBUILD_examples:Multiple_Subpackages

a bit related to #2061

Can you help with this Feature

@quepop

quepop commented 3 years ago

As @aalexanderr probably mentioned I'm working on the feature right now. Ive already implemented fetching and parsing but I'm not sure how to split my code so it would fit properly. I think it should look something like this:

  1. When scancode.io analyzes a new alpine docker image it requests a package object list from the packagedcode/alpine
  2. The packagedcode/alpine extracts installed packages and their info from an alpine db that lives inside that docker image
  3. build_package() or get_installed_packages() runs some function(s) from the packagedcode/alpine_build to extract the missing data before returning package object(s)
  4. packagecode/alpine_build downloads the needed resources (aports repo) using fetchcode and parses package-specific APKBUILD to extract source code download urls and possibly more missing data.

Should the packagecode/alpine_build only provide source urls for a package (so the rest would be handled in scancode.io) or should it also handle copyright extraction from the source code? The latter would be consistent with how for example packagedcode/debian handles copyrights - scancode.io recieves a package object list that already has copyrights info.

aalexanderr commented 3 years ago

@quepop I understand it as follows: the APKBUILD parser should currently live in scancode-toolkit.packagedcode/alpine_build.py - as it is the most logical place to have it right now without creating a new pkg. IMHO handling the aports repo (as in downloading, checking it out on specific commits) should be handled in scancode.io (using fetchcode) as from what I've understand scancode-toolkit does not download any supporting stuff, it just analyzes what is given to it. Later down the line both alpine_build.py & handling aports repo could be separated to alpine-inspector package ( a bit similar to https://github.com/nexB/debian-inspector )

quepop commented 3 years ago

I think we should use a cache dir to be able to reuse scan results (their id would be a combination of a package name and its version) so executing a pipeline on a new alpine docker image (project) could save some time (if ofc said alpine docker image has a package name - version combination that existed in previous projects)

pombredanne commented 3 years ago

@quepop Thank you++ I think in terms of code organization, things that are specific to Alpine should be in an alpine module. Things that would be generic (such as downloading each detected package sources and scanning for licenses) may be best in scancode.io for now?

This needs a bit thinking though do not let that slow you down! Here is a quick idea as a base:

@tdruez ^ FYI.

pombredanne commented 3 years ago

As discussed in https://github.com/nexB/purldb/issues/307 I am not super comfy with running arbitrary shell scripts during a scan. I reckon that APKBUILD may not be completely arbitrary and random but once plugged as a package manifest parser we could stumble on ill-formed or ill-intented and malicious APKBUILD files... therefore the approach of a static parsing and evaluation would be much better even though there could be a few kinks to handle left and right at scale, this feel a much safer approach.

For this I started a this PR https://github.com/nexB/scancode-toolkit/pull/2598 that can parse and evaluate top-level variables in an APKBUILD. It does not deal with subpackages defined in functions for now... but evaluating in a shell an APKBUILD would neither and the build would need to be launched to get the full details anyway.