Open JonoYang opened 1 year ago
@JonoYang Package Set is a great idea. I think it would help to get some concrete examples of the proposed new field package_content
since I am having trouble visualizing the detail values in a such a field.
@DennisClark
There are multiple JARs of log4j-core v2.0
at https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.0/ ,
We would create an entry in the PackageDB for each JAR. The tag in package_content
the category the files within it fall under. log4j-core-2.0-javadoc.jar
would have a package_content
value of doc
, log4j-core-2.0-sources.jar
would be source
, log4j-core-2.0-tests.jar
would be test
, and log4j-core-2.0.jar
would be binary
.
@JonoYang thanks -- my original problem was thinking that package_content
referred to the whole set, but instead we are planning to choose a specific label for each jar -- that makes a lot of sense!
@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.
@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.
@DennisClark I think so. For instance, these could be sets:
@JonoYang you wrote:
There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.
I guess some examples of how we could combine metadata could be either generic or package type specific or even package name or PURL-specific.
Btw, adding another example for this: https://repo1.maven.org/maven2/commons-daemon/commons-daemon/1.3.3/ We have a lot of variations of java source jars, native jars and other type of packages here
The package_content
field is an IntegerField with the following choices as values:
For now, this field will be populated for new Packages created using the maven or npm on-demand Package mining code. Existing Packages will have a null value in this field. In the future, we will have "improvers" that are tasks that will update Package data, and we'll have an improver that sets package_content
. (Related: another possible improver would be one that groups maven Packages with the same SHA1's to the same package_set. )
When we are in get_enhanced_package
, we have a package and the other packages from the same package set. We order the other packages based on package_content and purl fields. When we encounter a package that is of the same package_content as us, we skip it. We then update the following fields, if the package we are looking at has no values in them already:
UPDATEABLE_FIELDS = [
'primary_language',
'copyright',
'declared_license_expression',
'declared_license_expression_spdx',
'license_detections',
'other_license_expression',
'other_license_expression_spdx',
'other_license_detections',
# TODO: update extracted license statement and other fields together
# all license fields are based off of `extracted_license_statement` and should be treated as a unit
# hold off for now
'extracted_license_statement',
'notice_text',
'api_data_url',
'bug_tracking_url',
'code_view_url',
'vcs_url',
'source_packages',
'repository_homepage_url',
'dependencies',
'parties',
]
@JonoYang glad to see the progress report here ! My only concern is wondering how the consumer of scan results will know that when, for example, package_content=3, that the package is a SOURCE_REPO? Tools can be made smart enough to figure that out of course, but it means that someone looking at the raw json output might be unable to interpret the "3".
@DennisClark
Good catch! I'll update the Package serializers to return the package_content value's label, instead of the value.
We're running into the issue of multiple packages having the same source_repo package, since this conflicts with the constraint on the db for having unique purl values and download_urls together.
A solution may be to allow packages to be in multiple package sets. This would involve adding the field package_sets
to the Package model.
Right now, we set the package_set
value at package insert time, where we check to see if an existing package with the same purl values as the package we're about to make exists. If it does, we use that package's package_set value for our new package. If it doesn't, we create a new package_set value using uuid4()
.
The source_repo packages we want to create have a different purl than the packages they are for, so it is not always straightforward process to automatically associate the source_repo package to its binary or source_archive package. We would have to do this process by hand, knowing which Packages to associate our new source_repo package to.
We have the case where we have different instances of the same package. For example, for a maven Package, we can have the source, test, or doc JAR for that package. These JARs are part of the same package but would be considered different packages in the PackageDB, as they have different download URLs. These JARS would have similar purl values (same type, namespace, name, and version) but different subpath and qualifiers. We want to be able to relate this group of Packages together such that we can combine all the findings from the different varieties of a Package and return the data to the user.
After some discussion with @pombredanne, we will add two new fields to the Package model:
package_set
(working name)package_content
There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.
@pombredanne @DennisClark
I'd like to hear your thoughts on grouping like packages via ID