Introduce notion of Package set

JonoYang commented 1 year ago

We have the case where we have different instances of the same package. For example, for a maven Package, we can have the source, test, or doc JAR for that package. These JARs are part of the same package but would be considered different packages in the PackageDB, as they have different download URLs. These JARS would have similar purl values (same type, namespace, name, and version) but different subpath and qualifiers. We want to be able to relate this group of Packages together such that we can combine all the findings from the different varieties of a Package and return the data to the user.

After some discussion with @pombredanne, we will add two new fields to the Package model:

package_set (working name)
- This field contains a uuid or unique identifier for a group of packages. This ID would be unique to a particular purl (type, namespace, name, version) combination. The ID is generated when a Package with a purl that is not yet in the PackageDB is created. When another package with the same type, namespace, name, and version is created, the ID would also be used here.
package_content
- This field contains a tag that describes what is in this package, either source, binary, doc, tests, etc.

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

@pombredanne @DennisClark

I'd like to hear your thoughts on grouping like packages via ID

DennisClark commented 1 year ago

@JonoYang Package Set is a great idea. I think it would help to get some concrete examples of the proposed new field package_content since I am having trouble visualizing the detail values in a such a field.

JonoYang commented 1 year ago

@DennisClark

There are multiple JARs of log4j-core v2.0 at https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.0/ ,

We would create an entry in the PackageDB for each JAR. The tag in package_content the category the files within it fall under. log4j-core-2.0-javadoc.jar would have a package_content value of doc, log4j-core-2.0-sources.jar would be source, log4j-core-2.0-tests.jar would be test, and log4j-core-2.0.jar would be binary.

DennisClark commented 1 year ago

@JonoYang thanks -- my original problem was thinking that package_content referred to the whole set, but instead we are planning to choose a specific label for each jar -- that makes a lot of sense!

DennisClark commented 1 year ago

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

pombredanne commented 1 year ago

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

@DennisClark I think so. For instance, these could be sets:

all the different wheels of an lxml release and the source tarballs: https://pypi.org/project/lxml/#files
all the binary packages of Python 3.7 in Debian as well as the "orig" source and the "debian" patches https://packages.debian.org/buster/libpython3.7
a source rpm and all its binaries
a npm tarball and its GitHub source repo

pombredanne commented 1 year ago

@JonoYang you wrote:

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

I guess some examples of how we could combine metadata could be either generic or package type specific or even package name or PURL-specific.

a generic way license may flow through may be: Git repo -> source archive -> binary build
a type-specific way may be: maven JAR -> maven binary JAR

AyanSinhaMahapatra commented 1 year ago

Btw, adding another example for this: https://repo1.maven.org/maven2/commons-daemon/commons-daemon/1.3.3/ We have a lot of variations of java source jars, native jars and other type of packages here

JonoYang commented 1 year ago

The package_content field is an IntegerField with the following choices as values:

CURATION = 1
- This is a special package whose data maps to a particular package in a package set. Curations contain package data that has been curated by a user or from some other source.
PATCH = 2
SOURCE_REPO = 3
SOURCE_ARCHIVE = 4
BINARY = 5
TEST = 6
DOC = 7

For now, this field will be populated for new Packages created using the maven or npm on-demand Package mining code. Existing Packages will have a null value in this field. In the future, we will have "improvers" that are tasks that will update Package data, and we'll have an improver that sets package_content. (Related: another possible improver would be one that groups maven Packages with the same SHA1's to the same package_set. )

When we are in get_enhanced_package, we have a package and the other packages from the same package set. We order the other packages based on package_content and purl fields. When we encounter a package that is of the same package_content as us, we skip it. We then update the following fields, if the package we are looking at has no values in them already:

UPDATEABLE_FIELDS = [
    'primary_language',
    'copyright',

    'declared_license_expression',
    'declared_license_expression_spdx',
    'license_detections',
    'other_license_expression',
    'other_license_expression_spdx',
    'other_license_detections',
    # TODO: update extracted license statement and other fields together
    # all license fields are based off of `extracted_license_statement` and should be treated as a unit
    # hold off for now
    'extracted_license_statement',

    'notice_text',
    'api_data_url',
    'bug_tracking_url',
    'code_view_url',
    'vcs_url',
    'source_packages',
    'repository_homepage_url',
    'dependencies',
    'parties',
]

DennisClark commented 1 year ago

@JonoYang glad to see the progress report here ! My only concern is wondering how the consumer of scan results will know that when, for example, package_content=3, that the package is a SOURCE_REPO? Tools can be made smart enough to figure that out of course, but it means that someone looking at the raw json output might be unable to interpret the "3".

JonoYang commented 1 year ago

@DennisClark

Good catch! I'll update the Package serializers to return the package_content value's label, instead of the value.

JonoYang commented 1 year ago

We're running into the issue of multiple packages having the same source_repo package, since this conflicts with the constraint on the db for having unique purl values and download_urls together.

A solution may be to allow packages to be in multiple package sets. This would involve adding the field package_sets to the Package model.

Right now, we set the package_set value at package insert time, where we check to see if an existing package with the same purl values as the package we're about to make exists. If it does, we use that package's package_set value for our new package. If it doesn't, we create a new package_set value using uuid4().

The source_repo packages we want to create have a different purl than the packages they are for, so it is not always straightforward process to automatically associate the source_repo package to its binary or source_archive package. We would have to do this process by hand, knowing which Packages to associate our new source_repo package to.

aboutcode-org / purldb

Introduce notion of Package set #95