aboutcode-org / purldb

Tools to create and expose a database of purls (Package URLs). This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ and nexB for https://www.aboutcode.org/ Chat is at https://gitter.im/aboutcode-org/discuss
https://purldb.readthedocs.io/
35 stars 23 forks source link

Introduce notion of Package set #95

Open JonoYang opened 1 year ago

JonoYang commented 1 year ago

We have the case where we have different instances of the same package. For example, for a maven Package, we can have the source, test, or doc JAR for that package. These JARs are part of the same package but would be considered different packages in the PackageDB, as they have different download URLs. These JARS would have similar purl values (same type, namespace, name, and version) but different subpath and qualifiers. We want to be able to relate this group of Packages together such that we can combine all the findings from the different varieties of a Package and return the data to the user.

After some discussion with @pombredanne, we will add two new fields to the Package model:

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

@pombredanne @DennisClark

I'd like to hear your thoughts on grouping like packages via ID

DennisClark commented 1 year ago

@JonoYang Package Set is a great idea. I think it would help to get some concrete examples of the proposed new field package_content since I am having trouble visualizing the detail values in a such a field.

JonoYang commented 1 year ago

@DennisClark

There are multiple JARs of log4j-core v2.0 at https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.0/ , image

We would create an entry in the PackageDB for each JAR. The tag in package_content the category the files within it fall under. log4j-core-2.0-javadoc.jar would have a package_content value of doc, log4j-core-2.0-sources.jar would be source, log4j-core-2.0-tests.jar would be test, and log4j-core-2.0.jar would be binary.

DennisClark commented 1 year ago

@JonoYang thanks -- my original problem was thinking that package_content referred to the whole set, but instead we are planning to choose a specific label for each jar -- that makes a lot of sense!

DennisClark commented 1 year ago

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

pombredanne commented 1 year ago

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

@DennisClark I think so. For instance, these could be sets:

pombredanne commented 1 year ago

@JonoYang you wrote:

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

I guess some examples of how we could combine metadata could be either generic or package type specific or even package name or PURL-specific.

AyanSinhaMahapatra commented 1 year ago

Btw, adding another example for this: https://repo1.maven.org/maven2/commons-daemon/commons-daemon/1.3.3/ We have a lot of variations of java source jars, native jars and other type of packages here

JonoYang commented 1 year ago

The package_content field is an IntegerField with the following choices as values:

For now, this field will be populated for new Packages created using the maven or npm on-demand Package mining code. Existing Packages will have a null value in this field. In the future, we will have "improvers" that are tasks that will update Package data, and we'll have an improver that sets package_content. (Related: another possible improver would be one that groups maven Packages with the same SHA1's to the same package_set. )

When we are in get_enhanced_package, we have a package and the other packages from the same package set. We order the other packages based on package_content and purl fields. When we encounter a package that is of the same package_content as us, we skip it. We then update the following fields, if the package we are looking at has no values in them already:

UPDATEABLE_FIELDS = [
    'primary_language',
    'copyright',

    'declared_license_expression',
    'declared_license_expression_spdx',
    'license_detections',
    'other_license_expression',
    'other_license_expression_spdx',
    'other_license_detections',
    # TODO: update extracted license statement and other fields together
    # all license fields are based off of `extracted_license_statement` and should be treated as a unit
    # hold off for now
    'extracted_license_statement',

    'notice_text',
    'api_data_url',
    'bug_tracking_url',
    'code_view_url',
    'vcs_url',
    'source_packages',
    'repository_homepage_url',
    'dependencies',
    'parties',
]
DennisClark commented 1 year ago

@JonoYang glad to see the progress report here ! My only concern is wondering how the consumer of scan results will know that when, for example, package_content=3, that the package is a SOURCE_REPO? Tools can be made smart enough to figure that out of course, but it means that someone looking at the raw json output might be unable to interpret the "3".

JonoYang commented 1 year ago

@DennisClark

Good catch! I'll update the Package serializers to return the package_content value's label, instead of the value.

JonoYang commented 1 year ago

We're running into the issue of multiple packages having the same source_repo package, since this conflicts with the constraint on the db for having unique purl values and download_urls together.

A solution may be to allow packages to be in multiple package sets. This would involve adding the field package_sets to the Package model.

Right now, we set the package_set value at package insert time, where we check to see if an existing package with the same purl values as the package we're about to make exists. If it does, we use that package's package_set value for our new package. If it doesn't, we create a new package_set value using uuid4().

The source_repo packages we want to create have a different purl than the packages they are for, so it is not always straightforward process to automatically associate the source_repo package to its binary or source_archive package. We would have to do this process by hand, knowing which Packages to associate our new source_repo package to.