Design and use a proper "universal" and unique package identifier

pombredanne commented 7 years ago

We need a proper package id that is universal and unique: the difficulty is that each package management technology uses more or less parts in an identifiers beyond basic name+version: Maven GAVs, RPM NEVRA, etc. A simple solution is to have a single string ID with a prefix that describes what this is about and have a variable number of slash or colon-separated segments URN/URI-style such as used in:

openshift and fabric8 analytics: https://github.com/openshiftio/openshift.io/issues/1052#issuecomment-336523506
the new grafeas https://github.com/Grafeas/Grafeas/blob/9d3c7b3a08c74dfa42e91c1ecd428e163f7bfe0a/README.md#resource-urls ( @R2wenD2 is this a convention that is already in use? do you have a longer list of examples? )
CPEs, (more or less, with several caveats IMHO )
DejaCode-style URNs

This should probably be defined in ABC Data and with https://github.com/nexB/aboutcode/issues/6

R2wenD2 commented 7 years ago

The naming conventions set forth by Grafeas are at least partially in use by Grafeas partners. Unfortunately, there isn't currently a longer list.

pombredanne commented 7 years ago

@R2wenD2 Thank you for chiming in! Any pointer to some public or open source reference? or additional details you can share? Any documentation beyond your readme?

pombredanne commented 7 years ago

@mnonnenmacher @sschuberth ping, as this is a topic that is of interest you based on this discussion https://github.com/nexB/aboutcode/issues/6#issuecomment-334425350

R2wenD2 commented 7 years ago

The best public pointer I can share are jFrog xray component identifiers It has small differences from Grafeas (docker and generic files), but the others match. Does that help?

We're using the identifiers suggested by the package manager where applicable and just prefixing with info about which package manager has specified them. I believe the open question here is about what we use to specify the package manager.

What is currently missing from the Grafeas list that you'd like to see?

sschuberth commented 7 years ago

@pombredanne FYI, the Grafeas spec is at https://github.com/Grafeas/Grafeas

sschuberth commented 7 years ago

@pombredanne Also, right now we use the tuple $packageManager:$namespace:$name:$version as the identifier, where $namespace is "The namespace of the package, for example the group id in Maven or the scope in NPM".

pombredanne commented 7 years ago

@sschuberth this approach works well for Maven and NPMs but may not be work for other package managers/formats/identifiers. Some remarks:

namespaces do not exist for Pypi and Rubygems for instance
extra identifying elements may be typically needed for Linux distro packages (e.g. rpm and deb) such as distro/os, arch (though release/version/epoch of rpms and similar for debs could be stuffed alright in a version)
some extra qualifiers may exists in a few common cases (java vs plain ruby for gems, various combos of os/arch for pypi, type/classifiers/packaging for maven)

So one thing is if we could make everything fit in a schema with a fixed number of segments OR have a variable number of segments depending on the package managers/format/repo technology.

I tend to think the later is more flexible and provide more resilience to changes for the future.

R2wenD2 commented 7 years ago

I tend to agree with @pombredanne that the latter is more resilient and preferred.

sschuberth commented 7 years ago

namespaces do not exist for Pypi and Rubygems for instance

Correct. We simply use an empty thing in such cases.

and 3. could probably be captured by a generic optional extra qualifier field whose meaning depends on the package manager.

pombredanne commented 7 years ago

@R2wenD2 you wrote:

We're using the identifiers suggested by the package manager where applicable and just prefixing with info about which package manager has specified them. I believe the open question here is about what we use to specify the package manager.

Thank you for the xray pointer! Anyone there that you could ping? I have minor consistencies comments e.g. for instance using gav for things that are from Maven repos and using pip which is the manager for Python pypi packages seem a tad weird as it is either uncommon (gav) or not clear if what is referenced is a packaging format as served by some repo technology (e.g. gav vs. maven) or an installation tool vs. a repo technology (e.g. pip vs. pypi) but this is a great base and it would be easy enough to create bidirectional mappings between similar conventions if needed.

What is currently missing from the Grafeas list that you'd like to see?

At least Rubygems, Composer, Golang and CPAN for a start and many more ;) I can suggest/contribute conventions back to Grafeas FWIW

pombredanne commented 7 years ago

@reiz @andrew out of curiosity, do you use any such thing in versioneye or libraries.io ? @jpopelka your convention in openshift and fabric8 is "ecosystem:name:version" right? where the ecosystem is more or less the same as the grafeas "package manager" URL?

jpopelka commented 7 years ago

@jpopelka your convention in openshift and fabric8 is "ecosystem:name:version" right? where the ecosystem is more or less the same as the grafeas "package manager" URL?

Yes, our 'ecosystem' is basically a 'package manager'.

andrew commented 7 years ago

@pombredanne we don't have something similar to ecosystem/name/version but that doesn't quite enough to be truly unique it as it ignores architecture (rubygems for example can have different versions available for MRI/JRuby gems) and different registries (Maven central vs jfrog for example, but almost every package manager can point at different registries which may have the same name for a different package.

So really it needs to be something like registry-url/name/version/platform

For our purposes, especially considering Go (which has no registry), we've been treating the fully qualified url to the canonical package page as the unique identifier, although some package managers (like Bower) don't have individual urls for packages, so we just make one up from the registry domain.

pombredanne commented 7 years ago

@andrew Thanks! you wrote:

we've been treating the fully qualified url to the canonical package page as the unique identifier, although some package managers (like Bower) don't have individual urls for packages, so we just make one up from the registry domain

That's a clean and mostly universal approach too! But it does not always convey what a package "format" would be pointed to by a given URL unless this is for well known registries?

@R2wenD2 this brings up a possible ambiguities in the grafeas/xray approach:

the identifier points unambiguously to a unique file or set of files for your htts://(docker) and file:// schemes
OR to some package "name/version++" in other cases, and I assumed this would then be from the main public registry/repository for this package "format".

It does not include a notion of which repo/registry this packages lives in and furthermore each package name/version may have more than one "artifact" e.g. an sdist and many wheels for Python or an mri and jruby gem for Ruby, etc.

So in some cases you identify actual exact files or stack of files and in some other cases you identify some pointers to the primary public package repo for this format. This may be OK, but this may be also a source of confusion?

pombredanne commented 7 years ago

@chen-keinan re https://github.com/nexB/scancode-toolkit/issues/805#issuecomment-336848112 I may assume wrongly that you may be involved with Jfrog xrays? or not?

R2wenD2 commented 7 years ago

@elad165 Should be able to help here.

pombredanne commented 7 years ago

@elad165 any feedback?

pombredanne commented 7 years ago

So here is my proposal for ScanCode at least and ABC Data in general: I like the URL/URI approach from Grafeas and xrays a lot. Using plain HTTP URLs as used in libraries.io can work, but it is a tad too generic for my taste. But using URLs in general is a great thing.

Now about the parts of a package identifier:

1. First is a part to identify what is called ecosystem (openshitf), package_manager (here.com ORT), package_type (ass today in ScanCode) or URL scheme (in the Grafeas or Jfrog/xray URLs). The name used does not matter much, but what this means matters: it captures in a short string a lot of info:

name and version conventions for id
package formats, metadata/manifest formats (package.json, a vendor Go dir, a Gemfile, etc)
build, installation and dependencies resolution convention (e.g. setup.py, RPM specfile, how to document dep constraints, etc)
protocols to exchange with a remote package registry (registry APIs and protocols, etc)
package manager(s) that can deal with all these (npm, yarn, pip, mvn, etc)
a default public registry URL (e.g. implied with maven, npm, gems, etc)

There is no best attribute name to capture all this, but the meaning is clear enough. I will continue for now to call it package type in ScanCode. 'Kind', 'manager', 'scheme' would all be fair game too. The key point is that we can establish unambiguous mappings between all these 'types'. (e.g. pip and pypi, gav and maven are the same, etc)

Each type has different conventions but all have at least a name and a version (where the version may be a commit) In a very generalized way, we would have something like this from most to least significant:

type namespace name version qualifier

2. After the type, we have some namespace like specified in grafeas/xray for some types or in ORT by @sschuberth and team where the name and version are likely unique (e.g. a scoped npm scope name, the "jessie" debian distro, the /user/ in a github repo, a maven groupid, etc). What this string means is opaque and specific to a package type. It may be empty. This would be the host in a plain http URL.

3. Next, we have name and version: they are pretty explicit. The exact meaning may vary by type (e.g. sorting conventions to get the last version, etc). Note that for "structured" versions like in RPMs (epoch:version:release) or OSGi/Eclipse (where the version has an optional qualifier) (epoch:version:release) we will treat this whole as a version string. A version may also be a tag or commit'ish or date/time stamp as a last resort.

4. And then we have qualifier(s) (or suffix(es)) that are things like an OS or arch in RPMs or deb, or pypi environment markers, ruby mri vs. jruby packages, maven packaging/qualifier/type, a distro "level" such as el6 or el7 for RPMs or jessie for debian, etc. What this string means is opaque and also specific to a package type.

5. The last important part is the package "location" e.g. the package repository or registry needed to locate this package... The public registry is implied by a type. (e.g. rubygems.org or pypi.python.org or npmjs.org, etc as defaults). I would prefer to keep this entirely separate from the package identifier: we should use instead an extra separate URL attribute to point to an alternate registry, be it public or private beyond the implicit default for a package type. I suggest a registry_url attribute for this.

6. I left aside anything about content-based identifiers (e.g. checksums): this is a separate and solved topic that does not need much discussion IMHO... though I like to be able to identify a plain file that is effectively a package using only a checksum like in xray (with a "generic" type) as a last resort. This would be for a few odd cases anyway.

So in recap the package id is either a string, or discrete fields using this convention: `type` `namespace` `name` `version` `qualifier`

Attributes can be parsed from a string unambiguously given a type.
Most parts are optional, but you need something to form some id of course.
These form a natural hierarchy from most to least significant.

When used as a single string, they can form a URL of sorts using '/' , ':' or '@' -separators such as:

type://namespace/name:version:qualifier
or this: type:namespace/name:version:qualifier
or this: type://namespace/name:version@qualifier
or this: type://namespace/name@version#qualifier
or this: type:namespace:name:version:qualifier (which is a tad ambiguous)

As long as each segment can be unambiguously parsed back for a given type. I prefer using the plain colon to using // or /// as a separator but it does not make a big difference so long as it parses. Using an HTTP-like URL structure could have a lot of benefits as there are plenty of URLs parsers available.

Which one to pick?

Some examples:

pypi:///scancode-toolkit would mean the scancode-toolkit package on the public pypi regsitry. No version is specified, which is OK: this forms a natural hierarchy where data can be stored at higher or lower level nicely, the higher levels providing a default and lower levels can override this with specific data.

And pypi:///scancode-toolkit:2.0.0 would identify the scancode-toolkit package v 2.0.0. It may be a wheel for a certain or a source distro. A qualifier may be needed if you want extra precision about an exact wheel or source distribution.

Some other notes:

What are the main differences with the Grafeas/Xray approach? Not many, only small details.

Both seem to use things like OS/arch as "classifiers"/prefixes rather than qualifiers/suffixes for distro packages. IMHO this is not something that is higher in the identification hierarchy than the name/version (even though this is a major in locating arch-specific packages for Yum repos), so I prefer to stick with using a qualifier for these, especially since this is an indication of a build of a given version. It otherwise does not feel right to force to have an architecture identifier. But in any case the conversion from one to the other is straightforward.

For Docker, Grafeas use a plain https URL which squats the namespace segment (e.g. gcr.io): squatting the super common https scheme as a package type is not great IMHO. I prefer the xray clear docker type.

Also using @ as a separator for version/image/id is a variant to consider everywhere rather than making this an exception? e.g use @ as a version separator all the way like here in SPDX I specified using @ to separate a tag/commit (using the same convention as in pip VCS urls) https://github.com/spdx/spdx-spec/blob/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/3-package-information.md#37-package-download-location- ?

in Grafeas/xray I find using a gav type for maven a tad cryptic, but not a big deal as long as they map to each other OK.

pombredanne commented 7 years ago

And here is a short summary: I propose we use five parts/fields to identify a package:

type: such as maven, npm, gem, pypi
namespace: such as a maven groupid, a debian distro
name: the name
version: the version
qualifier: extra things beyond the version such as OS, architecture, src or doc, eg. that qualifies specific details for a given version

Most are optional and can also be composed in some URL as in: maven://org.apache.commons/io:1.2.3 e.g. type://namespace/name:version

The exact format to use for his URL is not fully specified yet.

And we add an extra field to point to an alternative package registry_url such as a private NPM repo.

pombredanne commented 7 years ago

Here is a revised design:

A package identifier is defined by six parts or fields that form a hierarchy, from the least specific to the most specific identifying information:

type: such as maven, npm, gem, pypi. required
namespace: such as a maven groupid, a debian distro. optional
name: the name. required
version: the version. optional
qualifiers: extra qualifying data for a package such as OS, architecture, src or doc. optional
path: optional path within a package. optional

At the minimum a type and a name are required. Other parts are optional.

A package identifier is either discrete fields or a URL string using these conventions:

type:namespace/name@version?qualifiers#path

For instance:

maven:org.apache.xmlgraphics/batik-anim@1.9.1?packaging=sources --> the source jar
go:google.golang.org/genproto#googleapis/api/annotations --> a path inside a Go package repo
rpm:fedora-25/curl@7.50.3-1.fc25?arch=src --> the src RPM
rpm:fedora-25/curl@7.50.3-1.fc25?arch=i386 --> a i386 build
docker:scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b --> a Docker image with a specific id as version
docker:cassandra@cassandra --> a Docker image with a specific tag as version
pypi:django@1.11.1
gem:ruby-advisory-db-check@0.12.4
gem:jruby-launcher/versions/1.1.2?platform=java --> a gem for JRuby
npm:%40angular/animation@12.3.1 --> a scoped node package

The string would UTF-8 encoded, with percent-encoding were needed https://en.wikipedia.org/wiki/Percent-encoding with these rules for each part:

type: composed only of ASCII letters and numbers, + and - and _.
namespace: contains zero or more segments, separated by a slash. Each segment must be a percent-encoded string.
name: must be a percent-encoded string.
version: must be a percent-encoded string.
qualifiers: must be a percent-encoded string. The content structure is defined by the type. IMHO the best would be a HTTP URL query string with name/value pairs.
path: contains zero or more segments, separated by a slash. Each segment must be a percent-encoded string.

The parsing approach would be:

split package identifier string once from right on #, right side is the path after percent-decoding.
split left side from 1. once from right on ?, right side is the qualifiers after percent-decoding (and eventual parsing of the query string name/value pairs.)
split left side from 2. once from right on /, right side is the name after percent-decoding.
split left side from 3. once from right on :, left side is the type, right side is the namespace after percent-decoding. .... which (in Python) would mean using the str.rpartition() function for the splits.

And to get an exact download, we either provide an optional registry base url if this is not on the standard public registry and/or an optional full direct download URL.

I cannot think of any case I know of that would not work with this approach. After all this is a URL which is a tried and true way to create identifiers and locators ;)

pombredanne commented 7 years ago

ok, I did not receive much pushback on the details, so I am assuming this is a good thing! @JonoYang suggested a great name for these: puurl standing for Package "mostly" Universal URL. The only cosmetic thing is whether to use type:// or type: for a "canonical" form. The // is not significant in any case here. e.g. this: maven:org.apache.commons/io@1.2.3 or this: maven://org.apache.commons/io@1.2.3

Aesthetically, the // looks much better I guess!

I have a Python implementation in the #275 branch here: https://github.com/nexB/scancode-toolkit/blob/275-streamline-package-manifests-models/src/packagedcode/models.py#L169 In particular the creation and parsing is straightforward.

It should be trivial to have a Go or Ruby or JS implementation

@R2wenD2 I would like to also contribute this spec to Grafeas FWIW.

R2wenD2 commented 7 years ago

Can you add a proposal issue to Grafeas? Feel free to copy your design proposal above - I just want to make sure folks interested in Grafeas have a chance to review. One small note - Grafeas can't support docker images by tag reference because tags are mutable.

pombredanne commented 7 years ago

@R2wenD2 you wrote:

Can you add a proposal issue to Grafeas? Feel free to copy your design proposal above - I just want to make sure folks interested in Grafeas have a chance to review.

Sure thing, that was the intent.

One small note - Grafeas can't support docker images by tag reference because tags are mutable.

Which is perfectly fine (I could not agree more and I always found these mutable tags to be a terrible wart). Both would be supported in puurls anyway but you could enforce in Grafeas that the version of a docker image MUST be a sha256 and reject plain tags. For containers the only difference with your current README would be that: https://gcr.io/scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b would become instead docker://gcr.io/scanning-customer/dockerimage@sha256:244fd47e07d1004f0aed9c156aa09083c82bf8944eceb67c946ff7430510a77b

BTW any preferences to use // or not ?

pombredanne commented 7 years ago

Side note: I think that plain web URLs should NOT be puurls ... e.g. https/https/ftp URLs would never be valid puurls. They can be used for package identification otherwise and possibly extra qualifiers, but not as plain puurls. So a puurl may not be always enough to identify exactly a package "artifact". You may need to add:

a repository_url for alternative package repositories beside the default public one (such as a private npm registry or maven repo) .... though some puurl type/schemes may have this built in their namespace as is the case for docker://gcr.io/foo/bar where the first segment of the namespace is a docker registry, exactly the same way this is used with a docker pull command. and in this case, when omitted, it would point to the default public docker hub registry.
a direct download_url that could be off a traditional package repository
one or more checksums to identify and verify a file content integrity such as sha256, etc. though some though some puurl type/schemes may have this built in their version conventions (such as for a docker://abc/def@sha256:4545454 version that points to an image ID that is a checksum.) or checksums could be assigned as key/value pairs in the qualifiers.

Also, the SPDX download location for version control identifiers (that I contributed) at https://github.com/spdx/spdx-spec/blob/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/3-package-information.md#37-package-download-location- and are based on Python pip could be valid puurls I guess. The syntax is the same and we could streamline these for puurls such as: vcs://host/path-to-repo@commitish#subpath-in-repo where the type for a vcs could be just git://, hg:// , svn:// or else and not specify the exact transport (e.g. no https/http as in git+https:// as this not essential)

Also some examples of puurls for things on github, gitlab or bitbucket:

github://nexB/scancode-toolkit@63a8c72868e3af061de
bitbucket://birkenfeld/pygments-main@4449acea8ee9afa9c447c50928b3da6bf60ae729

And the same examples of puurls for using a vcs type instead:

git://github.com/nexB/scancode-toolkit@63a8c72868e3af061de
hg://bitbucket.org/birkenfeld/pygments-main@4449acea8ee9afa9c447c50928b3da6bf60ae729

So it looks this is really fitting nicely overall and feels good in general. @elad165 all the credits for the original idea goes to you and Xray I guess, right? I would value a lot your feedback.

pombredanne commented 7 years ago

@sschuberth @mnonnenmacher IMHO it could be straightforward to evolve https://github.com/heremaps/oss-review-toolkit to use puurls too. @andrew could this be something you would consider as an addition to @librariesio and @dependencyci ? @reiz I guess that since you announced you are sunsetting @versioneye you may not care anymore at all ... and I am really sorry for this :| @jpopelka could this be something you would consider as an addition to fabric8 and openshift analytics ? @kartiksibal this is something that you should consider to adopt in vulnerablecode @singh1114 @RajuKoushik this is something that you should consider to adopt in scancode-server @chinyeungli this is something that you should contemplate to integrate in aboutcode-toolkit @jdaguil this is something that you should consider to adopt in aboutcode-manager

And also additional pings: your feedback would be much valued! @jeffmcaffer this is likely to be of some interest to you @goneall @kestewart this is likely to be of some interest to you for SPDX @jayfk this might be of interest to you for @pyup safety-db vulnerability database @ashcrow this might be interest to you for @victims victims-cve-db vulnerability database @grnd this might be interest to you for @snyk vulnerabilitydb vulnerability database @adulau @PidgeyL this might be interest to you wrt the things we discussed to get better packages ids mapping to CVEs and vulnerabilities for @cve-search

sschuberth commented 7 years ago

ok, I did not receive much pushback on the details, so I am assuming this is a good thing!

@pombredanne FYI, the reason why I was not commenting simply is because this by now is way too much information for me to find the time to go through thoroughly. I'll see what I can do over the next week.

sschuberth commented 7 years ago

puurl standing for Package "mostly" Universal URL.

I'm sorry to spoil the party here, but this sounds a bit too much like poo-url to me ;-) How about puid or simply id / identifier (in the context of a package) instead?

sschuberth commented 7 years ago

BTW any preferences to use // or not ?

I think we should have //, and more specifically, design "puurls" as URIs (not URLs) which the existing URI parsing functions in most common languages accept. On top of that, those parts of the "puurl" that are most commonly used should be directly accessible via URI parsing, i.e. instead of

type:namespace/name@version?qualifiers#path

I'd propose

type://version@namespace/name?qualifiers#path

so that there's the following mapping (using terminology from java.net.URI):

URI component	"puurl" component
scheme	type
userInfo (as part of the authority)	version
host (as part of the authority)	namespace
path	name
fragment	path
query	qualifiers

pombredanne commented 7 years ago

@sschuberth

FYI, the reason why I was not commenting simply is because this by now is way too much information for me to find the time to go through thoroughly. I'll see what I can do over the next week.

You are right .... this is a terribly bad place to review this. Let me push this in a separate repo in a PR amenable to review and comments.

I'm sorry to spoil the party here, but this sounds a bit too much like poo-url to me ;-) How about puid or simply id / identifier (in the context of a package) instead?

Sure why not, though this is meant to sound more like peeeeeerl than poo. We could call it puk too: package unique key

I think we should have //, and more specifically, design "puurls" as URIs (not URLs) which the existing URI parsing functions in most common languages accept.

As much as I was a purist about URI vs URL, the https://url.spec.whatwg.org/#goals says about this:

Standardize on the term URL. URI and IRI are just confusing.
In practice a single algorithm is used for both so keeping them
distinct is not helping anyone. 
URL also easily wins the search result popularity contest.

So calling them URL vs URI is not a big concern IMHO and they are locators alright I think.

I 'd propose type://version@namespace/name?qualifiers#path

The major drawback there is that this breaks the hierarchical property of the string: version is less significant than a namespace and name and it would come first here.

Also hacking the user/pass/host/port part of a URL is likely to eventually create confusion and ambiguities if the strings do not match what we expect.

In anycase let me put this in a proper text rather than a ticket so that proper commenting is made possible

pombredanne commented 7 years ago

Here is a place for commenting on a draft: https://github.com/puurl/puurl-spec/pull/1

sschuberth commented 7 years ago

As much as I was a purist about URI vs URL, the https://url.spec.whatwg.org/#goals says about this:

I'm unsure about the authority of this resource, and I find the arguments a bit bogus ("URL also easily wins the search result popularity contest").

Semantically, an URI would be the correct thing for us as we'd like to identify, not locate.

The major drawback there is that this breaks the hierarchical property of the string:

So, what's the benefit of having that hierarchy? For parsing it does not matter.

pombredanne commented 7 years ago

@sschuberth you wrote:

Semantically, an URI would be the correct thing for us as we'd like to identify, not locate.

Per https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Conceptual_distinctions :

A URL is simply a URI that happens to point to a resource over a network.

And actually a puurl would be a URL in this definition as it is always a locator that points to resource over a network: the location is implied for a package type as based on the default repository of the type (e.g. rubygems.org for gem, maven.org for maven, pypi.python.org for pypi, registry.npmjs.org for npms, hub.docker.com for docker, etc.); and in some cases the location can be explicit for some type when included in the namespace such as with docker (e.g. docker://gcr.io/customer/dockerimage@sha256:244fd47e07d1004f0aed9c is not pointing to the default docker hub but to another "registry")

The major drawback there is that this breaks the hierarchical property of the string:

So, what's the benefit of having that hierarchy? For parsing it does not matter.

The benefit is that puurl strings are sortable without parsing: this is a valuable property when dealing with a large number of puurls in a database or even small numbers in UI lists.

But beside this, not hacking the user/pass part of a URI/URL is also to ensure that we do not depend on the specifics, history and constraints of how these parts are made and parsed which could lead to subtle and weird errors, such ass not serializing back a user/pass because this is something that typically should not be echoed. Using something that does not reuse any of the user/pass/host/port parts of a URL or URI also avoids dealing with all the baked-in complexity of idna, ip v4 and v6 addresses, host-relative paths and scheme-relative paths and so on. Most serious URI/URL parsing libraries deal with many of these corner cases at various levels: IMHO any the benefits from reusing standard parsers as-is is likely outweighed by the issues attached when carrying the attached baggage and history and corner cases.

sschuberth commented 7 years ago

And actually a puurl would be a URL in this definition as it is always a locator that points to resource over a network

I was thinking a "puurl" to be more general in that it could also point to an artifact in a local Maven repository, for example. So it's not necessarily a network resource. Also, I still believe its primary purpose is to identify, not to locate, or?

The benefit is that puurl strings are sortable without parsing

Ok, granted, that might be useful.

In any case, pretty please with sugar on top, come up with a different name than "puurl". Pronouncing it like "peeeeeerl" (= "pee-url") does not make it any better.

pombredanne commented 7 years ago

@sschuberth you wrote:

I was thinking a "puurl" to be more general in that it could also point to an artifact in a local Maven repository, for example. So it's not necessarily a network resource. Also, I still believe its primary purpose is to identify, not to locate, or?

This is not necessarily a network resource indeed the same way the file:// URL is a local, non networked resource. I think that by making the location implicit in the type we have in vast majority of the cases a nice shorthand to both identify and locate a public package. In the special case of local-only packages, then a download_url (eventually using local file path) supplement this nicely IMHO.

As a side note since the "authority" is implied by default based on a type the use of :// may need to be banned and : would be the separator after a type?

Also we could define a standard qualifier key/value pair to point to an alternate repository URL, since this is qualification and not primary data IMHO?

In any case, pretty please with sugar on top, come up with a different name than "puurl". Pronouncing it like "peeeeeerl" (= "pee-url") does not make it any better.

I think that @JonoYang was thinking more about the sound about a cat :cat: purr https://en.wikipedia.org/wiki/Purr rather than pee or poo (I cannot believe we are discussing potty in a serious ticket :rofl: )

I like the purring side.

But sure, we can do better! What about puke as in Package Universal Key Enumeration? somehow I fear you might object to this too though!

Or purrl as in Package Uniform Real Resource Locator? with a cat :smirk_cat: bonus.

Now more seriously what if we call it simply package_url or purl for short? (this has other meanings per https://en.wikipedia.org/wiki/Purl_(disambiguation) but that does not matter too much)

sschuberth commented 7 years ago

:// may need to be banned and : would be the separator after a type?

I'd still prefer to keep it, as it simply reads more common if you're used to URIs.

since this is qualification and not primary data IMHO?

I'd be fine with that.

What about puke as in Package Universal Key Enumeration?

You must be kidding me!

purl and purrl are fine with me, although I'd still prefer puid.

pombredanne commented 7 years ago

@sschuberth

I'd still prefer to keep it [://], as it simply reads more common if you're used to URIs.

That's ok, this is not significant anyway in the case of a purl.

You must be kidding me!

I was, of course ;)

Let's go with purl

andrew commented 7 years ago

Happy to implement on Libraries.io once the spec is finished 👌

pombredanne commented 7 years ago

All: I created a separate and neutral org GitHub @package-url and published something that starts to look like a decent draft spec at https://github.com/package-url/purl-spec/tree/initial-draft and this PR https://github.com/package-url/purl-spec/pull/1

Please provide feedback and comments over there instead! This would be much appreciated.

pombredanne commented 7 years ago

I pushed a Python prl lib at https://github.com/package-url/purl-python (and on Pypi as a pypi:purl-python@0.2.0 purl :smile_cat: ) as well as a JSON test fixture to use as a language neutral test suite by any implementation at https://github.com/package-url/purl-test-suite

@ashcrow started working on a Go implementation :heart_eyes: at https://github.com/package-url/purl-go

See other refinements and discussions at https://github.com/package-url/purl-spec/pull/1 And a growing witty FAQ at https://github.com/package-url/purl-spec/wiki/FAQ

As for scancode, I will be adopting the purl-python library in the branch: https://github.com/nexB/scancode-toolkit/tree/275-streamline-package-manifests-models which will become the base for improved package scanning and detection as per #832 when merged in develop.

Next up will be to adopt it in other https://aboutcode.org tools such as scancode-server and also in vulnerablecode. In this later emerging tool, this will actually remove a mental road block to cleanly map CVEs vulnerabilities to actual software packages (and map these to CPEs when they exist)

pombredanne commented 7 years ago

FYI @ashcrow contributed a Go implementation at https://github.com/package-url/packageurl-go :tada:

pombredanne commented 6 years ago

Package URL is now implemented alright in develop and works well. Next step is to call the prul spec as a 1.0

pombredanne commented 6 years ago

I am closing this now. The Package URL lives its own life now at https://github.com/package-url ... and is heavily used in ScanCode and other places. Thanks you all for the contributions and feeback!

aboutcode-org / scancode-toolkit