Closed pombredanne closed 6 years ago
The naming conventions set forth by Grafeas are at least partially in use by Grafeas partners. Unfortunately, there isn't currently a longer list.
@R2wenD2 Thank you for chiming in! Any pointer to some public or open source reference? or additional details you can share? Any documentation beyond your readme?
@mnonnenmacher @sschuberth ping, as this is a topic that is of interest you based on this discussion https://github.com/nexB/aboutcode/issues/6#issuecomment-334425350
The best public pointer I can share are jFrog xray component identifiers It has small differences from Grafeas (docker and generic files), but the others match. Does that help?
We're using the identifiers suggested by the package manager where applicable and just prefixing with info about which package manager has specified them. I believe the open question here is about what we use to specify the package manager.
What is currently missing from the Grafeas list that you'd like to see?
@pombredanne FYI, the Grafeas spec is at https://github.com/Grafeas/Grafeas
@pombredanne Also, right now we use the tuple $packageManager:$namespace:$name:$version
as the identifier, where $namespace
is "The namespace of the package, for example the group id in Maven or the scope in NPM".
@sschuberth this approach works well for Maven and NPMs but may not be work for other package managers/formats/identifiers. Some remarks:
namespaces do not exist for Pypi and Rubygems for instance
extra identifying elements may be typically needed for Linux distro packages (e.g. rpm and deb) such as distro/os, arch (though release/version/epoch of rpms and similar for debs could be stuffed alright in a version)
some extra qualifiers may exists in a few common cases (java vs plain ruby for gems, various combos of os/arch for pypi, type/classifiers/packaging for maven)
So one thing is if we could make everything fit in a schema with a fixed number of segments OR have a variable number of segments depending on the package managers/format/repo technology.
I tend to think the later is more flexible and provide more resilience to changes for the future.
I tend to agree with @pombredanne that the latter is more resilient and preferred.
- namespaces do not exist for Pypi and Rubygems for instance
Correct. We simply use an empty thing in such cases.
@R2wenD2 you wrote:
We're using the identifiers suggested by the package manager where applicable and just prefixing with info about which package manager has specified them. I believe the open question here is about what we use to specify the package manager.
Thank you for the xray pointer! Anyone there that you could ping?
I have minor consistencies comments e.g. for instance using gav
for things that are from Maven repos and using pip which is the manager for Python pypi packages seem a tad weird as it is either uncommon (gav) or not clear if what is referenced is a packaging format as served by some repo technology (e.g. gav vs. maven) or an installation tool vs. a repo technology (e.g. pip vs. pypi) but this is a great base and it would be easy enough to create bidirectional mappings between similar conventions if needed.
What is currently missing from the Grafeas list that you'd like to see?
At least Rubygems, Composer, Golang and CPAN for a start and many more ;) I can suggest/contribute conventions back to Grafeas FWIW
@reiz @andrew out of curiosity, do you use any such thing in versioneye or libraries.io ? @jpopelka your convention in openshift and fabric8 is "ecosystem:name:version" right? where the ecosystem is more or less the same as the grafeas "package manager" URL?
@jpopelka your convention in openshift and fabric8 is "ecosystem:name:version" right? where the ecosystem is more or less the same as the grafeas "package manager" URL?
Yes, our 'ecosystem' is basically a 'package manager'.
@pombredanne we don't have something similar to ecosystem/name/version
but that doesn't quite enough to be truly unique it as it ignores architecture (rubygems for example can have different versions available for MRI/JRuby gems) and different registries (Maven central vs jfrog for example, but almost every package manager can point at different registries which may have the same name for a different package.
So really it needs to be something like registry-url/name/version/platform
For our purposes, especially considering Go (which has no registry), we've been treating the fully qualified url to the canonical package page as the unique identifier, although some package managers (like Bower) don't have individual urls for packages, so we just make one up from the registry domain.
@andrew Thanks! you wrote:
we've been treating the fully qualified url to the canonical package page as the unique identifier, although some package managers (like Bower) don't have individual urls for packages, so we just make one up from the registry domain
That's a clean and mostly universal approach too! But it does not always convey what a package "format" would be pointed to by a given URL unless this is for well known registries?
@R2wenD2 this brings up a possible ambiguities in the grafeas/xray approach:
It does not include a notion of which repo/registry this packages lives in and furthermore each package name/version may have more than one "artifact" e.g. an sdist and many wheels for Python or an mri and jruby gem for Ruby, etc.
So in some cases you identify actual exact files or stack of files and in some other cases you identify some pointers to the primary public package repo for this format. This may be OK, but this may be also a source of confusion?
@chen-keinan re https://github.com/nexB/scancode-toolkit/issues/805#issuecomment-336848112 I may assume wrongly that you may be involved with Jfrog xrays? or not?
@elad165 Should be able to help here.
@elad165 any feedback?
So here is my proposal for ScanCode at least and ABC Data in general: I like the URL/URI approach from Grafeas and xrays a lot. Using plain HTTP URLs as used in libraries.io can work, but it is a tad too generic for my taste. But using URLs in general is a great thing.
Now about the parts of a package identifier:
1. First is a part to identify what is called ecosystem (openshitf), package_manager (here.com ORT), package_type (ass today in ScanCode) or URL scheme (in the Grafeas or Jfrog/xray URLs). The name used does not matter much, but what this means matters: it captures in a short string a lot of info:
There is no best attribute name to capture all this, but the meaning is clear enough.
I will continue for now to call it package type
in ScanCode. 'Kind', 'manager', 'scheme' would all be fair game too. The key point is that we can establish unambiguous mappings between all these 'types'. (e.g. pip and pypi, gav and maven are the same, etc)
Each type
has different conventions but all have at least a name and a version (where the version may be a commit) In a very generalized way, we would have something like this from most to least significant:
type
namespace
name
version
qualifier
2. After the type
, we have some namespace
like specified in grafeas/xray for some types or in ORT by @sschuberth and team where the name
and version
are likely unique (e.g. a scoped npm scope name, the "jessie" debian distro, the /user/ in a github repo, a maven groupid, etc).
What this string means is opaque and specific to a package type. It may be empty. This would be the host
in a plain http URL.
3. Next, we have name
and version
: they are pretty explicit. The exact meaning may vary by type
(e.g. sorting conventions to get the last version, etc). Note that for "structured" versions like in RPMs (epoch:version:release) or OSGi/Eclipse (where the version has an optional qualifier) (epoch:version:release) we will treat this whole as a version string. A version
may also be a tag or commit'ish or date/time stamp as a last resort.
4. And then we have qualifier
(s) (or suffix(es)) that are things like an OS or arch in RPMs or deb, or pypi environment markers, ruby mri vs. jruby packages, maven packaging/qualifier/type, a distro "level" such as el6 or el7 for RPMs or jessie for debian, etc. What this string means is opaque and also specific to a package type.
5. The last important part is the package "location
" e.g. the package repository or registry needed to locate this package... The public registry is implied by a type. (e.g. rubygems.org or pypi.python.org or npmjs.org, etc as defaults). I would prefer to keep this entirely separate from the package identifier: we should use instead an extra separate URL attribute to point to an alternate registry, be it public or private beyond the implicit default for a package type
. I suggest a registry_url
attribute for this.
6. I left aside anything about content-based identifiers (e.g. checksums): this is a separate and solved topic that does not need much discussion IMHO... though I like to be able to identify a plain file that is effectively a package using only a checksum like in xray (with a "generic
" type
) as a last resort. This would be for a few odd cases anyway.
type
namespace
name
version
qualifier
When used as a single string, they can form a URL of sorts using '/' , ':' or '@' -separators such as:
type://namespace/name:version:qualifier
type:namespace/name:version:qualifier
type://namespace/name:version@qualifier
type://namespace/name@version#qualifier
type:namespace:name:version:qualifier
(which is a tad ambiguous)As long as each segment can be unambiguously parsed back for a given type
.
I prefer using the plain colon to using //
or ///
as a separator but it does not make a big difference so long as it parses. Using an HTTP-like URL structure could have a lot of benefits as there are plenty of URLs parsers available.
Which one to pick?
pypi:///scancode-toolkit
would mean the scancode-toolkit package on the public pypi regsitry.
No version is specified, which is OK: this forms a natural hierarchy where data can be stored at higher or lower level nicely, the higher levels providing a default and lower levels can override this with specific data.
And pypi:///scancode-toolkit:2.0.0
would identify the scancode-toolkit package v 2.0.0. It may be a wheel for a certain or a source distro. A qualifier may be needed if you want extra precision about an exact wheel or source distribution.
Both seem to use things like OS/arch as "classifiers"/prefixes rather than qualifiers/suffixes for distro packages. IMHO this is not something that is higher in the identification hierarchy than the name/version (even though this is a major in locating arch-specific packages for Yum repos), so I prefer to stick with using a qualifier for these, especially since this is an indication of a build of a given version. It otherwise does not feel right to force to have an architecture identifier. But in any case the conversion from one to the other is straightforward.
https
URL which squats the namespace
segment (e.g. gcr.io): squatting the super common https
scheme as a package type
is not great IMHO. I prefer the xray clear docker
type.Also using @
as a separator for version/image/id is a variant to consider everywhere rather than making this an exception? e.g use @
as a version separator all the way like here in SPDX I specified using @ to separate a tag/commit (using the same convention as in pip VCS urls) https://github.com/spdx/spdx-spec/blob/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/3-package-information.md#37-package-download-location- ?
gav
type for maven
a tad cryptic, but not a big deal as long as they map to each other OK.And here is a short summary: I propose we use five parts/fields to identify a package:
type
: such as maven, npm, gem, pypinamespace
: such as a maven groupid, a debian distroname
: the nameversion
: the versionqualifier
: extra things beyond the version such as OS, architecture, src or doc, eg. that qualifies specific details for a given versionMost are optional and can also be composed in some URL as in:
maven://org.apache.commons/io:1.2.3
e.g. type://namespace/name:version
The exact format to use for his URL is not fully specified yet.
And we add an extra field to point to an alternative package registry_url
such as a private NPM repo.
Here is a revised design:
A package identifier is defined by six parts or fields that form a hierarchy, from the least specific to the most specific identifying information:
type
: such as maven, npm, gem, pypi. requirednamespace
: such as a maven groupid, a debian distro. optionalname
: the name. requiredversion
: the version. optionalqualifiers
: extra qualifying data for a package such as OS, architecture, src or doc. optionalpath
: optional path within a package. optionalAt the minimum a type and a name are required. Other parts are optional.
A package identifier is either discrete fields or a URL string using these conventions:
type:namespace/name@version?qualifiers#path
For instance:
maven:org.apache.xmlgraphics/batik-anim@1.9.1?packaging=sources
--> the source jargo:google.golang.org/genproto#googleapis/api/annotations
--> a path inside a Go package repo rpm:fedora-25/curl@7.50.3-1.fc25?arch=src
--> the src RPMrpm:fedora-25/curl@7.50.3-1.fc25?arch=i386
--> a i386 builddocker:scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b
--> a Docker image with a specific id as versiondocker:cassandra@cassandra
--> a Docker image with a specific tag as versionpypi:django@1.11.1
gem:ruby-advisory-db-check@0.12.4
gem:jruby-launcher/versions/1.1.2?platform=java
--> a gem for JRuby npm:%40angular/animation@12.3.1
--> a scoped node packageThe string would UTF-8 encoded, with percent-encoding were needed https://en.wikipedia.org/wiki/Percent-encoding with these rules for each part:
type
: composed only of ASCII letters and numbers, + and - and _.namespace
: contains zero or more segments, separated by a slash. Each segment must be a percent-encoded string.name
: must be a percent-encoded string.version
: must be a percent-encoded string.qualifiers
: must be a percent-encoded string. The content structure is defined by the type. IMHO the best would be a HTTP URL query string with name/value pairs. path
: contains zero or more segments, separated by a slash. Each segment must be a percent-encoded string.The parsing approach would be:
#
, right side is the path
after percent-decoding.?
, right side is the qualifiers
after percent-decoding (and eventual parsing of the query string name/value pairs.)/
, right side is the name
after percent-decoding.:
, left side is the type
, right side is the namespace
after percent-decoding.
.... which (in Python) would mean using the str.rpartition()
function for the splits.And to get an exact download, we either provide an optional registry base url if this is not on the standard public registry and/or an optional full direct download URL.
I cannot think of any case I know of that would not work with this approach. After all this is a URL which is a tried and true way to create identifiers and locators ;)
ok, I did not receive much pushback on the details, so I am assuming this is a good thing!
@JonoYang suggested a great name for these: puurl
standing for Package "mostly" Universal URL
.
The only cosmetic thing is whether to use type://
or type:
for a "canonical" form. The //
is not significant in any case here.
e.g. this:
maven:org.apache.commons/io@1.2.3
or this:
maven://org.apache.commons/io@1.2.3
Aesthetically, the // looks much better I guess!
I have a Python implementation in the #275 branch here: https://github.com/nexB/scancode-toolkit/blob/275-streamline-package-manifests-models/src/packagedcode/models.py#L169 In particular the creation and parsing is straightforward.
It should be trivial to have a Go or Ruby or JS implementation
@R2wenD2 I would like to also contribute this spec to Grafeas FWIW.
Can you add a proposal issue to Grafeas? Feel free to copy your design proposal above - I just want to make sure folks interested in Grafeas have a chance to review. One small note - Grafeas can't support docker images by tag reference because tags are mutable.
@R2wenD2 you wrote:
Can you add a proposal issue to Grafeas? Feel free to copy your design proposal above - I just want to make sure folks interested in Grafeas have a chance to review.
Sure thing, that was the intent.
One small note - Grafeas can't support docker images by tag reference because tags are mutable.
Which is perfectly fine (I could not agree more and I always found these mutable tags to be a terrible wart). Both would be supported in puurl
s anyway but you could enforce in Grafeas that the version of a docker image MUST be a sha256 and reject plain tags. For containers the only difference with your current README would be that:
https://gcr.io/scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b
would become instead
docker://gcr.io/scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b
BTW any preferences to use //
or not ?
Side note: I think that plain web URLs should NOT be puurl
s ... e.g. https/https/ftp URLs would never be valid puurl
s. They can be used for package identification otherwise and possibly extra qualifiers
, but not as plain puurl
s.
So a puurl
may not be always enough to identify exactly a package "artifact".
You may need to add:
repository_url
for alternative package repositories beside the default public one (such as a private npm registry or maven repo) .... though some puurl
type/schemes may have this built in their namespace
as is the case for docker://gcr.io/foo/bar
where the first segment of the namespace
is a docker registry, exactly the same way this is used with a docker pull command. and in this case, when omitted, it would point to the default public docker hub registry.download_url
that could be off a traditional package repository
checksums
to identify and verify a file content integrity such as sha256
, etc. though some though some puurl
type/schemes may have this built in their version conventions (such as for a docker://abc/def@sha256:4545454
version that points to an image ID that is a checksum.) or checksums could be assigned as key/value pairs in the qualifiers
. Also, the SPDX download location for version control identifiers (that I contributed) at https://github.com/spdx/spdx-spec/blob/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/3-package-information.md#37-package-download-location- and are based on Python pip could be valid puurl
s I guess. The syntax is the same and we could streamline these for puurl
s such as:
vcs://host/path-to-repo@commitish#subpath-in-repo
where the type
for a vcs could be just git://
, hg://
, svn://
or else and not specify the exact transport (e.g. no https/http as in git+https://
as this not essential)
Also some examples of puurl
s for things on github, gitlab or bitbucket:
github://nexB/scancode-toolkit@63a8c72868e3af061de
bitbucket://birkenfeld/pygments-main@4449acea8ee9afa9c447c50928b3da6bf60ae729
And the same examples of puurl
s for using a vcs type instead:
git://github.com/nexB/scancode-toolkit@63a8c72868e3af061de
hg://bitbucket.org/birkenfeld/pygments-main@4449acea8ee9afa9c447c50928b3da6bf60ae729
So it looks this is really fitting nicely overall and feels good in general. @elad165 all the credits for the original idea goes to you and Xray I guess, right? I would value a lot your feedback.
@sschuberth @mnonnenmacher IMHO it could be straightforward to evolve https://github.com/heremaps/oss-review-toolkit to use puurl
s too.
@andrew could this be something you would consider as an addition to @librariesio and @dependencyci ?
@reiz I guess that since you announced you are sunsetting @versioneye you may not care anymore at all ... and I am really sorry for this :|
@jpopelka could this be something you would consider as an addition to fabric8 and openshift analytics ?
@kartiksibal this is something that you should consider to adopt in vulnerablecode
@singh1114 @RajuKoushik this is something that you should consider to adopt in scancode-server
@chinyeungli this is something that you should contemplate to integrate in aboutcode-toolkit
@jdaguil this is something that you should consider to adopt in aboutcode-manager
And also additional pings: your feedback would be much valued! @jeffmcaffer this is likely to be of some interest to you @goneall @kestewart this is likely to be of some interest to you for SPDX @jayfk this might be of interest to you for @pyup safety-db vulnerability database @ashcrow this might be interest to you for @victims victims-cve-db vulnerability database @grnd this might be interest to you for @snyk vulnerabilitydb vulnerability database @adulau @PidgeyL this might be interest to you wrt the things we discussed to get better packages ids mapping to CVEs and vulnerabilities for @cve-search
ok, I did not receive much pushback on the details, so I am assuming this is a good thing!
@pombredanne FYI, the reason why I was not commenting simply is because this by now is way too much information for me to find the time to go through thoroughly. I'll see what I can do over the next week.
puurl
standing forPackage "mostly" Universal URL
.
I'm sorry to spoil the party here, but this sounds a bit too much like poo-url to me ;-) How about puid
or simply id
/ identifier
(in the context of a package) instead?
BTW any preferences to use
//
or not ?
I think we should have //
, and more specifically, design "puurls" as URIs (not URLs) which the existing URI parsing functions in most common languages accept. On top of that, those parts of the "puurl" that are most commonly used should be directly accessible via URI parsing, i.e. instead of
type:namespace/name@version?qualifiers#path
I'd propose
type://version@namespace/name?qualifiers#path
so that there's the following mapping (using terminology from java.net.URI):
URI component | "puurl" component |
---|---|
scheme | type |
userInfo (as part of the authority) | version |
host (as part of the authority) | namespace |
path | name |
fragment | path |
query | qualifiers |
@sschuberth
FYI, the reason why I was not commenting simply is because this by now is way too much information for me to find the time to go through thoroughly. I'll see what I can do over the next week.
You are right .... this is a terribly bad place to review this. Let me push this in a separate repo in a PR amenable to review and comments.
I'm sorry to spoil the party here, but this sounds a bit too much like poo-url to me ;-) How about puid or simply id / identifier (in the context of a package) instead?
Sure why not, though this is meant to sound more like peeeeeerl
than poo. We could call it puk
too: package unique key
I think we should have //, and more specifically, design "puurls" as URIs (not URLs) which the existing URI parsing functions in most common languages accept.
As much as I was a purist about URI vs URL, the https://url.spec.whatwg.org/#goals says about this:
Standardize on the term URL. URI and IRI are just confusing.
In practice a single algorithm is used for both so keeping them
distinct is not helping anyone.
URL also easily wins the search result popularity contest.
So calling them URL vs URI is not a big concern IMHO and they are locators alright I think.
I 'd propose
type://version@namespace/name?qualifiers#path
The major drawback there is that this breaks the hierarchical property of the string: version is less significant than a namespace and name and it would come first here.
Also hacking the user/pass/host/port part of a URL is likely to eventually create confusion and ambiguities if the strings do not match what we expect.
In anycase let me put this in a proper text rather than a ticket so that proper commenting is made possible
Here is a place for commenting on a draft: https://github.com/puurl/puurl-spec/pull/1
As much as I was a purist about URI vs URL, the https://url.spec.whatwg.org/#goals says about this:
I'm unsure about the authority of this resource, and I find the arguments a bit bogus ("URL also easily wins the search result popularity contest").
Semantically, an URI would be the correct thing for us as we'd like to identify, not locate.
The major drawback there is that this breaks the hierarchical property of the string:
So, what's the benefit of having that hierarchy? For parsing it does not matter.
@sschuberth you wrote:
Semantically, an URI would be the correct thing for us as we'd like to identify, not locate.
Per https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Conceptual_distinctions :
A URL is simply a URI that happens to point to a resource over a network.
And actually a puurl
would be a URL in this definition as it is always a locator that points to resource over a network
: the location is implied for a package type
as based on the default repository
of the type
(e.g. rubygems.org for gem, maven.org for maven, pypi.python.org for pypi, registry.npmjs.org for npms, hub.docker.com for docker, etc.); and in some cases the location can be explicit for some type
when included in the namespace
such as with docker (e.g. docker://gcr.io/customer/dockerimage@sha256:244fd47e07d1004f0aed9c
is not pointing to the default docker hub but to another "registry")
The major drawback there is that this breaks the hierarchical property of the string:
So, what's the benefit of having that hierarchy? For parsing it does not matter.
The benefit is that puurl
strings are sortable without parsing: this is a valuable property when dealing with a large number of puurl
s in a database or even small numbers in UI lists.
But beside this, not hacking the user/pass part of a URI/URL is also to ensure that we do not depend on the specifics, history and constraints of how these parts are made and parsed which could lead to subtle and weird errors, such ass not serializing back a user/pass because this is something that typically should not be echoed. Using something that does not reuse any of the user/pass/host/port parts of a URL or URI also avoids dealing with all the baked-in complexity of idna, ip v4 and v6 addresses, host-relative paths and scheme-relative paths and so on. Most serious URI/URL parsing libraries deal with many of these corner cases at various levels: IMHO any the benefits from reusing standard parsers as-is is likely outweighed by the issues attached when carrying the attached baggage and history and corner cases.
And actually a
puurl
would be a URL in this definition as it is always a locator thatpoints to resource over a network
I was thinking a "puurl" to be more general in that it could also point to an artifact in a local Maven repository, for example. So it's not necessarily a network resource. Also, I still believe its primary purpose is to identify, not to locate, or?
The benefit is that
puurl
strings are sortable without parsing
Ok, granted, that might be useful.
In any case, pretty please with sugar on top, come up with a different name than "puurl". Pronouncing it like "peeeeeerl" (= "pee-url") does not make it any better.
@sschuberth you wrote:
I was thinking a "puurl" to be more general in that it could also point to an artifact in a local Maven repository, for example. So it's not necessarily a network resource. Also, I still believe its primary purpose is to identify, not to locate, or?
This is not necessarily a network resource indeed the same way the file://
URL is a local, non networked resource. I think that by making the location implicit in the type
we have in vast majority of the cases a nice shorthand to both identify and locate a public package. In the special case of local-only packages, then a download_url
(eventually using local file path) supplement this nicely IMHO.
As a side note since the "authority" is implied by default based on a type
the use of ://
may need to be banned and :
would be the separator after a type
?
Also we could define a standard qualifier
key/value pair to point to an alternate repository URL, since this is qualification and not primary data IMHO?
In any case, pretty please with sugar on top, come up with a different name than "puurl". Pronouncing it like "peeeeeerl" (= "pee-url") does not make it any better.
I think that @JonoYang was thinking more about the sound about a cat :cat: purr https://en.wikipedia.org/wiki/Purr rather than pee or poo (I cannot believe we are discussing potty in a serious ticket :rofl: )
I like the purring side.
But sure, we can do better!
What about puke
as in Package Universal Key Enumeration? somehow I fear you might object to this too though!
Or purrl
as in Package Uniform Real Resource Locator? with a cat :smirk_cat: bonus.
Now more seriously what if we call it simply package_url
or purl
for short? (this has other meanings per https://en.wikipedia.org/wiki/Purl_(disambiguation) but that does not matter too much)
://
may need to be banned and:
would be the separator after atype
?
I'd still prefer to keep it, as it simply reads more common if you're used to URIs.
since this is qualification and not primary data IMHO?
I'd be fine with that.
What about
puke
as in Package Universal Key Enumeration?
You must be kidding me!
purl
and purrl
are fine with me, although I'd still prefer puid
.
@sschuberth
I'd still prefer to keep it [
://
], as it simply reads more common if you're used to URIs.
That's ok, this is not significant anyway in the case of a purl
.
You must be kidding me!
I was, of course ;)
Let's go with purl
Happy to implement on Libraries.io once the spec is finished 👌
All: I created a separate and neutral org GitHub @package-url and published something that starts to look like a decent draft spec at https://github.com/package-url/purl-spec/tree/initial-draft and this PR https://github.com/package-url/purl-spec/pull/1
Please provide feedback and comments over there instead! This would be much appreciated.
I pushed a Python prl lib at https://github.com/package-url/purl-python (and on Pypi as a pypi:purl-python@0.2.0
purl :smile_cat: ) as well as a JSON test fixture to use as a language neutral test suite by any implementation at https://github.com/package-url/purl-test-suite
@ashcrow started working on a Go implementation :heart_eyes: at https://github.com/package-url/purl-go
See other refinements and discussions at https://github.com/package-url/purl-spec/pull/1 And a growing witty FAQ at https://github.com/package-url/purl-spec/wiki/FAQ
As for scancode, I will be adopting the purl-python library in the branch: https://github.com/nexB/scancode-toolkit/tree/275-streamline-package-manifests-models which will become the base for improved package scanning and detection as per #832 when merged in develop.
Next up will be to adopt it in other https://aboutcode.org tools such as scancode-server and also in vulnerablecode. In this later emerging tool, this will actually remove a mental road block to cleanly map CVEs vulnerabilities to actual software packages (and map these to CPEs when they exist)
FYI @ashcrow contributed a Go implementation at https://github.com/package-url/packageurl-go :tada:
Package URL is now implemented alright in develop and works well. Next step is to call the prul spec as a 1.0
I am closing this now. The Package URL lives its own life now at https://github.com/package-url ... and is heavily used in ScanCode and other places. Thanks you all for the contributions and feeback!
We need a proper package id that is universal and unique: the difficulty is that each package management technology uses more or less parts in an identifiers beyond basic name+version: Maven GAVs, RPM NEVRA, etc. A simple solution is to have a single string ID with a prefix that describes what this is about and have a variable number of slash or colon-separated segments URN/URI-style such as used in:
This should probably be defined in ABC Data and with https://github.com/nexB/aboutcode/issues/6