Closed pombredanne closed 7 months ago
@pombredanne Here's a manual mockup. Is this what you have in mind for the restructured meta
output? And what about the outputs for the other 3 current commands?
This needs to be a list to handle multiple input PURLs. (For now, I left the nested purl
where it is in the output we get from fetchcode/package.py info() -- do you want me to try to move that field to the top from code inside the purlcli.py meta
command?)
[
{
"headers": [
{
"purl": "pkg:pypi/fetchcode"
}
],
"packages": [
{
"metadata": [
{
"type": "pypi",
"namespace": null,
"name": "fetchcode",
"version": null,
"qualifiers": {},
"subpath": null,
"primary_language": null,
"description": null,
"release_date": null,
"parties": [],
"keywords": [],
"homepage_url": "https://github.com/nexB/fetchcode",
"download_url": null,
"api_url": "https://pypi.org/pypi/fetchcode/json",
"size": null,
"sha1": null,
"md5": null,
"sha256": null,
"sha512": null,
"bug_tracking_url": null,
"code_view_url": null,
"vcs_url": null,
"copyright": null,
"license_expression": null,
"declared_license": "Apache-2.0",
"notice_text": null,
"root_path": null,
"dependencies": [],
"contains_source_code": null,
"source_packages": [],
"purl": "pkg:pypi/fetchcode",
"repository_homepage_url": null,
"repository_download_url": null,
"api_data_url": null
}
// {additional dictionaries -- 1 for each version of the package}
]
}
]
}
]
So the "metadata" should be called "packages" and in general we want the same output structure as scancode-toolkit. The input purls should be in a "header". We need a bit more design on the URLs, but the urls should be a single mapping, like the subset of the packages and not a list of mappings.
@pombredanne I am now lost re the structure and content of the meta
output. This is what fetchcode info()
produces. Do we want something different? If so, what?
the "metadata" should be called "packages"
-- but my mockup is taken directly from SCTK output (or is it dated and has that changed again?) headers
and packages
are the 2 top level fields for a scan and so would be the same for EACH input PURL, right?
in general we want the same output structure as scancode-toolkit
-- that's what this mockup is. What do you mean?
The input purls should be in a "header".
-- they already are in this mockup.
@johnmhoran so for the confusion, we crossed path.... In https://github.com/nexB/purldb/issues/247#issuecomment-1906630287 I replied to https://github.com/nexB/purldb/issues/247#issuecomment-1906429880 and to your latest mockup https://github.com/nexB/purldb/issues/247#issuecomment-1906606993
Unfortunately that does not clarify or respond to my comments/questions.
I am not going to attempt to change any current data structure or content until we can clarify what we want. I will try now to access the packageurl-python purl2url code, run pip install -e .
there as @JonoYang explained yesterday (at least that's my understanding), make some edits there, and see if I can access that new purl2url code from my current purldb branch where my PURL CLI code lives.
@johnmhoran re: https://github.com/nexB/purldb/issues/247#issuecomment-1906606993
Do not use nesting in a list and do not further nest in metadata. Instead use something like this for the URLs, which is the same structure as ScanCode TK.
And also for the metadata (just there is more with metadata). With URLs, if there is an option to validate the URLs exist, then you could not return anything if they do not and return error messages too.
Adopting the same structure means that tools that know scancode format will support this too.
{
"headers": [
{
"tool_name": "purlcli",
"tool_version": "v32.0.8-156-g7b867f3bec",
"options": {
"command": "meta",
"--purl": [
"pkg:pypi/scancode-toolkit@2.0.0",
"pkg:pypi/scancode-toolkit@3.0.0"
],
"--output": "foo.json"
},
"errors": [],
"warnings": []
}
],
"packages": [
{
"purl": "pkg:pypi/scancode-toolkit@32.0.8",
"homepage_url": "https://github.com/nexB/scancode-toolkit",
"download_url": null,
"bug_tracking_url": null,
"code_view_url": null,
"vcs_url": null,
"repository_homepage_url": null,
"repository_download_url": null,
"api_data_url": null
},
{
"purl": "pkg:pypi/scancode-toolkit@2.0.8",
"homepage_url": "https://github.com/nexB/scancode-toolkit",
"download_url": null,
"bug_tracking_url": null,
"code_view_url": null,
"vcs_url": null,
"repository_homepage_url": null,
"repository_download_url": null,
"api_data_url": null
}
]
}
@pombredanne Are you referring now to the new urls
command, or to the meta
command, which is what we've been discussing most recently? And are you saying you want this structure for all 4 current command outputs, or just meta
? or just urls -- the content you list is URLs but not the meta content, as you can see from my examples above of meta content.
Are you referring now to the new urls command, or to the meta command, which is what we've been discussing most recently? And are you saying you want this structure for all 4 current command outputs, or just meta? or just urls -- the content you list is URLs but not the meta content, as you can see from my examples above of meta content.
Yes, for meta, urls, versions except for validate which is special, but should still adopt a similar output
@JonoYang I ran pip install -e .
in my new packageurl-python repo branch, added a simple print function at the bottom of the purl2url.py file, and tried to call it from my purldb purlcli.py urls
command -- no dice. Not sure how I might have strayed from your guidance of yesterday.
(venv) Tue Jan 23, 2024 10:58 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli urls --purl pkg:pypi/fetchcode@0.1.0 --output -
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 402, in <module>
purlcli()
File "/home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 256, in get_urls
purl_urls = get_url_details(purls)
File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 332, in get_url_details
test_print = purl2url.print_hello(purl)
AttributeError: module 'packageurl.contrib.purl2url' has no attribute 'print_hello'
(venv) Tue Jan 23, 2024 10:58 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
@JonoYang In case I misinterpreted your guidance, I also tried to run pip install -e .
in the purldb venv packageurl directory, but got a different error.
(venv) Tue Jan 23, 2024 11:24 AM /home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages jmh (247-purl-cli-add-urls)
$ cd packageurl
(venv) Tue Jan 23, 2024 11:24 AM /home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl jmh (247-purl-cli-add-urls)
$ ll
total 48
drwxr-xr-x 4 jmh jmh 4096 2023-12-18 17:48:55.241376800 -0800 ./
drwxr-xr-x 255 jmh jmh 16384 2024-01-18 12:12:22.245371200 -0800 ../
-rw-r--r-- 1 jmh jmh 17351 2023-12-18 17:48:55.231376800 -0800 __init__.py
drwxr-xr-x 2 jmh jmh 4096 2023-12-18 17:50:43.351376800 -0800 __pycache__/
drwxr-xr-x 5 jmh jmh 4096 2023-12-18 17:48:55.241376800 -0800 contrib/
-rw-r--r-- 1 jmh jmh 0 2023-12-18 17:48:55.231376800 -0800 py.typed
(venv) Tue Jan 23, 2024 11:24 AM /home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl jmh (247-purl-cli-add-urls)
$ pip install -e .
Obtaining file:///home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl
ERROR: file:///home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
[notice] A new release of pip available: 22.2.2 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
(venv) Tue Jan 23, 2024 11:28 AM /home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl jmh (247-purl-cli-add-urls)
$
My purldb repo venv does not contain a packageurl-python
directory.
But I can access purl2url
with this:
from packageurl.contrib import purl2url
Do I need to install packageurl-python in that venv, and then from that new venv directory run pip install -e .
?
Since I'm unable to access my test function in the local packageurl-python repo from my work in the local purldb repo, I will put the purl2url work aside -- will need assistance on figuring this out. Meanwhile, will turn to restructuring the output of the 4 existing commands per @pombredanne 's significant restructuring requests of this morning.
@johnmhoran Lets have a session regarding the venv stuff
@pombredanne @JonoYang I'm making progress on restructuring the JSON output, starting with the meta
command. When the PURLs are identified in the command itself with one or more --purl
flags, the JSON now lists them. When they are submitted instead in a file, our JSON currently looks like this excerpt:
{
"headers": [
{
"tool_name": "purlcli",
"tool_version": "___",
"options": {
"command": "meta",
"--purl": [],
"--file": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/txt-input/2024-01-23-purl-meta-input-01.txt",
"--output": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/json-output/2024-01-24-meta--output-01.json"
},
"errors": [],
"warnings": [
"There was an error with your 'pkg:pypi/matchcode' query. Make sure that 'pkg:pypi/matchcode' actually exists in the relevant repository."
]
}
],
"packages": [
. . .
In this case, do we want the JSON to include a list of the PURLs contained in the input file? If so, this could be reported in the options
section, perhaps something like this immediately below the --file
key-value pair:
"options": {
"command": "meta",
"--purl": [],
"--file": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/txt-input/2024-01-23-purl-meta-input-01.txt",
"--file_purls": [
"pkg:pypi/fetchcode",
"pkg:pypi/matchcode",
"pkg:pypi/minecode"
],
"--output": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/json-output/2024-01-24-meta--output-01.json"
},
However, that doesn't look correct since --file_purls
is not actually an option. What do you think?
@johnmhoran Maybe something like this, where we have a list of all purls passed into the command, outside of options
{
"headers": [
{
"tool_name": "purlcli",
"tool_version": "___",
"options": {
"command": "meta",
"--purl": [],
"--file": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/txt-input/2024-01-23-purl-meta-input-01.txt",
"--output": "/mnt/c/nexb/purldb-testing/2024-current-01-testing/json-output/2024-01-24-meta--output-01.json"
},
"purls": [
"pkg:pypi/fetchcode",
"pkg:pypi/matchcode",
"pkg:pypi/minecode"
],
"errors": [],
"warnings": [
"There was an error with your 'pkg:pypi/matchcode' query. Make sure that 'pkg:pypi/matchcode' actually exists in the relevant repository."
]
}
],
...
Thanks @JonoYang . That would definitely work. There'd be some redundant data in that case since I think we'd also want to populate that list when there is no file and the PURLs are identified with the --purl
flag(s) and thus are also listed in the --purl
value under options
. In any event, thanks for a good solution.
@pombredanne @JonoYang I just pushed a commit (and a second fixer-upper) and opened a new PR -- https://github.com/nexB/purldb/pull/281.
meta
, validate
, versions
and urls
. More to follow.meta
code and tests updated with the SCTK-like data structure and ready for feedback.urls
now -- created a rough work-in-progress first draft which I'm refactoring/beefing up (with tests to follow) to adopt the SCTK-like data structure. It will also need additional URLs, some to come from my future work with purl2url.py
in packageurl-python
, which I've forked and wired up to my local purldb
code/repo.validate
and versions
commands were the first two I created and have good code and tests for the old data structure -- informative output in a demo but not yet ready for prime time -- I'll turn to those (including more tests) as soon as I've finished the urls
code/test work.@JonoYang Does this give me a list of the current PURL types supported in the API validate
endpoint?
from packagedb.package_managers import VERSION_API_CLASSES_BY_PACKAGE_TYPE
. . .
for k, v in VERSION_API_CLASSES_BY_PACKAGE_TYPE.items():
print(f"{k} = {v}")
The output:
gem = <class 'packagedb.package_managers.RubyVersionAPI'>
hex = <class 'packagedb.package_managers.HexVersionAPI'>
cargo = <class 'packagedb.package_managers.CratesVersionAPI'>
composer = <class 'packagedb.package_managers.ComposerVersionAPI'>
pypi = <class 'packagedb.package_managers.PypiVersionAPI'>
nuget = <class 'packagedb.package_managers.NugetVersionAPI'>
deb = <class 'packagedb.package_managers.DebianVersionAPI'>
maven = <class 'packagedb.package_managers.MavenVersionAPI'>
npm = <class 'packagedb.package_managers.NpmVersionAPI'>
golang = <class 'packagedb.package_managers.GoproxyVersionAPI'>
Trying to dig into the details a bit further, this command
python -m purldb_toolkit.purlcli validate --purl pkg:cargo/rand --purl pkg:composer/uuid --purl pkg:deb/2ping --purl pkg:gem/small_wonder --purl pkg:golang/github.com/golang/glog --purl pkg:hex/zzz --purl pkg:maven/com.google.appengine/appengine-tools-sdk --purl pkg:npm/abbrev --purl pkg:nuget/log4net --purl pkg:pypi/dejacode --purl pkg:rubygems/bundler-sass --purl pkg:ubuntu/zzz --purl pkg:jmh/zzz --output -
gives me the following output and states that each PURL is valid but that check_existence
is not supported for pkg:rubygems/bundler-sass
, pkg:ubuntu/zzz
or pkg:jmh/zzz
.
I expected no support for pkg:jmh
but was not sure what to expect for pkg:rubygems
or pkg:ubuntu
(guessing the latter is handled under pkg:deb
).
(venv) Mon Feb 05, 2024 01:10 PM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli validate --purl pkg:cargo/rand --purl pkg:composer/uuid --purl pkg:deb/2ping --purl pkg:gem/small_wonder --purl pkg:golang/github.com/golang/glog --purl pkg:hex/zzz --purl pkg:maven/com.google.appengine/appengine-tools-sdk --purl pkg:npm/abbrev --purl pkg:nuget/log4net --purl pkg:pypi/dejacode --purl pkg:rubygems/bundler-sass --purl pkg:ubuntu/zzz --purl pkg:jmh/zzz --output -
VERSION_API_CLASSES_BY_PACKAGE_TYPE = {'composer': <class 'packagedb.package_managers.ComposerVersionAPI'>, 'pypi': <class 'packagedb.package_managers.PypiVersionAPI'>, 'nuget': <class 'packagedb.package_managers.NugetVersionAPI'>, 'deb': <class 'packagedb.package_managers.DebianVersionAPI'>, 'maven': <class 'packagedb.package_managers.MavenVersionAPI'>, 'npm': <class 'packagedb.package_managers.NpmVersionAPI'>, 'golang': <class 'packagedb.package_managers.GoproxyVersionAPI'>, 'gem': <class 'packagedb.package_managers.RubyVersionAPI'>, 'hex': <class 'packagedb.package_managers.HexVersionAPI'>, 'cargo': <class 'packagedb.package_managers.CratesVersionAPI'>}
composer = <class 'packagedb.package_managers.ComposerVersionAPI'>
pypi = <class 'packagedb.package_managers.PypiVersionAPI'>
nuget = <class 'packagedb.package_managers.NugetVersionAPI'>
deb = <class 'packagedb.package_managers.DebianVersionAPI'>
maven = <class 'packagedb.package_managers.MavenVersionAPI'>
npm = <class 'packagedb.package_managers.NpmVersionAPI'>
golang = <class 'packagedb.package_managers.GoproxyVersionAPI'>
gem = <class 'packagedb.package_managers.RubyVersionAPI'>
hex = <class 'packagedb.package_managers.HexVersionAPI'>
cargo = <class 'packagedb.package_managers.CratesVersionAPI'>
[
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:cargo/rand"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:composer/uuid"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:deb/2ping"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:gem/small_wonder"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:golang/github.com/golang/glog"
},
{
"valid": true,
"exists": false,
"message": "The provided PackageURL is valid, but does not exist in the upstream repo.",
"purl": "pkg:hex/zzz"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:maven/com.google.appengine/appengine-tools-sdk"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:npm/abbrev"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:nuget/log4net"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode"
},
{
"valid": true,
"exists": null,
"message": "The provided PackageURL is valid, but `check_existence` is not supported for this package type.",
"purl": "pkg:rubygems/bundler-sass"
},
{
"valid": true,
"exists": null,
"message": "The provided PackageURL is valid, but `check_existence` is not supported for this package type.",
"purl": "pkg:ubuntu/zzz"
},
{
"valid": true,
"exists": null,
"message": "The provided PackageURL is valid, but `check_existence` is not supported for this package type.",
"purl": "pkg:jmh/zzz"
}
]
(venv) Mon Feb 05, 2024 01:19 PM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
The meta
command relies on fetchcode/package.py
, which includes this router -- @router.route("pkg:rubygems/.*")
-- suggesting the pkg:rubygems
is supported.
The urls
command relies on packageurl-python/src/packageurl/contrib/purl2url.py
and includes both @repo_router.route("pkg:(gem|rubygems)/.*")
and @download_router.route("pkg:(gem|rubygems)/.*")
, suggesting that pkg:rubygems
(and pkg:gems
) is supported.
I suppose the explanation is that these packages are related and perhaps use only the gems
type, but if that's documented or explained somewhere, I have yet to find it.
Does this give me a list of the current PURL types supported in the API validate endpoint?
I think it should show you all the package types that it can look up, but @keshav-space can tell you more.
...check_existence is not supported for pkg:rubygems/bundler-sass, pkg:ubuntu/zzz or pkg:jmh/zzz.
My guess is that on the purldb side, we didn't associate the rubygems
with packagedb.package_managers.RubyVersionAPI
in VERSION_API_CLASSES_BY_PACKAGE_TYPE
, so that's why it says check_existence
isn't supported.
I suppose the explanation is that these packages are related and perhaps use only the gems type, but if that's documented or explained somewhere, I have yet to find it.
The package-url repo has the specs for purl and the package types: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst
Thanks @JonoYang . I've visited the spec page many times but it does not explain the references to both pkg:gem and pkg:rubygems since there it has only the former. A bit confusing, or incompelete.
and https://github.com/package-url/packageurl-python includes a handful of references to pkg:rubygems
:
>>> from packageurl.contrib import purl2url
>>> purl2url.get_repo_url("pkg:rubygems/bundler@2.3.23")
"https://rubygems.org/gems/bundler/versions/2.3.23"
>>> purl2url.get_download_url("pkg:rubygems/bundler@2.3.23")
"https://rubygems.org/downloads/bundler-2.3.23.gem"
>>> purl2url.get_inferred_urls("pkg:rubygems/bundler@2.3.23")
["https://rubygems.org/gems/bundler/versions/2.3.23", "https://rubygems.org/downloads/bundler-2.3.23.gem",]
Does this give me a list of the current PURL types supported in the API validate endpoint?
I think it should show you all the package types that it can look up, but @keshav-space can tell you more.
@johnmhoran for Rubygems we use gem
as a type see here https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#gem
GET /api/validate/?purl=pkg:gem/bundler-sass&check_existence=true
HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:gem/bundler-sass"
}
For ubuntu package we use deb
as type and ubuntu
as namespace more here https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#deb
GET /api/validate/?purl=pkg:deb/ubuntu/curl&check_existence=true
HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:deb/ubuntu/curl"
}
Thank you @keshav-space . In my validate
command (which accesses the validate
endpoint), I noticed that both pkg:deb/2ping
and pkg:deb/debian/2ping
succeed, while in the versions
command (which calls versions()
in fetchcode/package_versions.py
), pkg:deb/2ping
returns None
(I catch that with an if
and print a warning) while pkg:deb/debian/2ping
returns a list of versions.
(venv) Tue Feb 06, 2024 07:49 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli validate --purl pkg:deb/2ping --purl pkg:deb/debian/2ping --output -
[
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:deb/2ping"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:deb/debian/2ping"
}
]
(venv) Tue Feb 06, 2024 07:50 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli versions --purl pkg:deb/2ping --purl pkg:deb/debian/2ping --output -
There was an error with your 'pkg:deb/2ping' query. Make sure that 'pkg:deb/2ping' actually exists in the relevant repository.
[
{
"purl": "pkg:deb/debian/2ping",
"versions": [
{
"purl": "pkg:deb/debian/2ping@4.5-1.2",
"version": "4.5-1.2",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@4.5-1.1",
"version": "4.5-1.1",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@4.5-1",
"version": "4.5-1",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@4.3-1",
"version": "4.3-1",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@3.2.1-1+deb9u1",
"version": "3.2.1-1+deb9u1",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@2.1.1-1",
"version": "2.1.1-1",
"release_date": "None"
},
{
"purl": "pkg:deb/debian/2ping@2.0-1",
"version": "2.0-1",
"release_date": "None"
}
]
}
]
(venv) Tue Feb 06, 2024 07:50 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
@keshav-space Re pkg:gem/
vs. pkg:rubygems/
, I've noticed that my meta
command (which calls info()
in fetchcode/package.py
) supports rubygems
but not gem
.
(venv) Tue Feb 06, 2024 08:11 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli meta --purl pkg:gem/bundler-sass --purl pkg:rubygems/bundler-sass --output -
The provided PackageURL 'pkg:gem/bundler-sass' is valid, but `meta` is not supported for this package type.
{
"headers": [
{
"tool_name": "purlcli",
"tool_version": "0.0.1",
"options": {
"command": "meta",
"--purl": [
"pkg:gem/bundler-sass",
"pkg:rubygems/bundler-sass"
],
"--file": null,
"--output": "<stdout>"
},
"purls": [
"pkg:gem/bundler-sass",
"pkg:rubygems/bundler-sass"
],
"errors": [],
"warnings": [
"The provided PackageURL 'pkg:gem/bundler-sass' is valid, but `meta` is not supported for this package type."
]
}
],
"packages": [
{
"purl": "pkg:rubygems/bundler-sass",
"type": "rubygems",
"namespace": null,
"name": "bundler-sass",
"version": null,
"qualifiers": {},
"subpath": null,
"primary_language": null,
"description": null,
"release_date": null,
"parties": [],
"keywords": [],
"homepage_url": "http://github.com/vogelbek/bundler-sass",
"download_url": "https://rubygems.org/gems/bundler-sass-0.1.2.gem",
"api_url": "https://rubygems.org/api/v1/gems/bundler-sass.json",
"size": null,
"sha1": null,
"md5": null,
"sha256": null,
"sha512": null,
"bug_tracking_url": null,
"code_view_url": null,
"vcs_url": null,
"copyright": null,
"license_expression": null,
"declared_license": [
"MIT"
],
"notice_text": null,
"root_path": null,
"dependencies": [],
"contains_source_code": null,
"source_packages": [],
"repository_homepage_url": null,
"repository_download_url": null,
"api_data_url": null
}
]
}
(venv) Tue Feb 06, 2024 08:13 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
@keshav-space Similarly, my nascent urls
command (which calls packageurl/contrib/purl2url.py
) also supports pkg:rubygems/
but not pkg:gems/
. (The error message is still under development as you can see from the output.)
(venv) Tue Feb 06, 2024 08:18 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli urls --purl pkg:gem/bundler-sass --purl pkg:rubygems/bundler-sass --output -
From construct_headers():
valid_but_not_supported
From get_urls_details():
valid_but_not_supported
{
"headers": [
{
"tool_name": "purlcli",
"tool_version": "0.0.1",
"options": {
"command": "urls",
"--purl": [
"pkg:gem/bundler-sass",
"pkg:rubygems/bundler-sass"
],
"--file": null,
"--output": "<stdout>"
},
"purls": [
"pkg:gem/bundler-sass",
"pkg:rubygems/bundler-sass"
],
"errors": [],
"warnings": [
"valid_but_not_supported"
]
}
],
"packages": [
{
"purl": "pkg:rubygems/bundler-sass",
"download_url": {
"url": null
},
"inferred_urls": [
{
"url": "https://rubygems.org/gems/bundler-sass"
}
],
"repo_download_url": {
"url": null
},
"repo_download_url_by_package_type": {
"url": null
},
"repo_url": {
"url": "https://rubygems.org/gems/bundler-sass"
},
"url": {
"url": "https://rubygems.org/gems/bundler-sass"
}
}
]
}
(venv) Tue Feb 06, 2024 08:18 AM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
This is wierd shouldn't we go by spec https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#gem ?
Yes, exactly my point (and the source of my confusion re gem
vs rubygems
as types ;-).
I have a correction re the urls
command, which calls packageurl-python/src/packageurl/contrib/purl2url.py
-- this appears to support both pkg:gem/
and pkg:rubygems/
, using rubygems.org
URLs in its output.
If I understand the code correctly, purl2url.py
has two primary sections that govern the generation of two categories of URLs:
repo_url
(which uses the decorator @repo_router.route()
download_url
(which uses the decorator @download_router.route()
Each of these sections uses as a decorator argument/parameter a definition of the particular package type(s) it covers, e.g., @repo_router.route("pkg:cargo/.*")
And each section includes both gem
and rubygems
in one of its decorators: @repo_router.route("pkg:(gem|rubygems)/.*")
and @download_router.route("pkg:(gem|rubygems)/.*")
.
See https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py#L173-L186 and https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py#L294-L305
@JonoYang It looks like the validate
endpoint check_existence
step includes a version if included (@
) in the input PURL as part of its validation process, but that the check_existence
step does not handle PURL qualifiers (?
) or subpaths (#
) and instead just ignores them. Is that an accurate statement?
I think it's accurate and on that basis plan to strip identifiable qualifers and subpaths from incoming PURLs before processing, to note that in the warnings
, and to note that in the help section. (I'm including the check_existence
step as a default in our validate
CLI command -- no additional option/flag to invoke.)
For illustration, in a recent CLI test the validate
endpoint concluded that
pkg:pypi/dejacode@5.0.0?os=windows
"is valid, and the package exists in the upstream repo."pkg:pypi/dejacode@5.0.0os=windows
"is valid, but does not exist in the upstream repo."pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy
"is valid, and the package exists in the upstream repo."pkg:pypi/dejacode@5.0.0#how/are/you
"is valid, and the package exists in the upstream repo."(venv) Wed Feb 07, 2024 05:36 PM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli validate --purl pkg:pypi/dejacode --purl pkg:pypi/dejacode@5.0.0 --purl pkg:pypi/dejacode@5.0.0?os=windows --purl pkg:pypi/dejacode@5.0.0os=windows --purl pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy --purl pkg:pypi/dejacode@5.0.0#how/are/you --purl pkg:pypi/dejacode@10.0.0 --purl pkg:cargo/rand@0.7.2 --purl pkg:nginx/nginx@0.8.9?os=windows --output -
[
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode@5.0.0"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode@5.0.0?os=windows"
},
{
"valid": true,
"exists": false,
"message": "The provided PackageURL is valid, but does not exist in the upstream repo.",
"purl": "pkg:pypi/dejacode@5.0.0os=windows"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:pypi/dejacode@5.0.0#how/are/you"
},
{
"valid": true,
"exists": false,
"message": "The provided PackageURL is valid, but does not exist in the upstream repo.",
"purl": "pkg:pypi/dejacode@10.0.0"
},
{
"valid": true,
"exists": true,
"message": "The provided Package URL is valid, and the package exists in the upstream repo.",
"purl": "pkg:cargo/rand@0.7.2"
},
{
"valid": true,
"exists": null,
"message": "The provided PackageURL is valid, but `check_existence` is not supported for this package type.",
"purl": "pkg:nginx/nginx@0.8.9?os=windows"
}
]
(venv) Wed Feb 07, 2024 05:48 PM /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$
@johnmhoran
It looks like the validate endpoint check_existence step includes a version if included (@) in the input PURL as part of its validation process, but that the check_existence step does not handle PURL qualifiers (?) or subpaths (#) and instead just ignores them. Is that an accurate statement?
Yes. It looks like we use the type, namespace, and name of the package, then collect all the available versions and then check to see if the version specified in the purl is in the list of available versions. (https://github.com/nexB/purldb/blob/main/packagedb/api.py#L810)
I think it's accurate and on that basis plan to strip identifiable qualifers and subpaths from incoming PURLs before processing,
I think that's fair for now. Thinking a bit, it would be good for us to eventually handle qualifiers in the validate endpoint, especially in the case of maven purls, where it would be useful to see which specific artifacts are available on maven.
Thanks @JonoYang. And I agree re eventually handling qualifiers.
@JonoYang This can wait till next week but in case you have an immediate thought: I'm working on the meta
command and want to remove all version, qualifier and subpath data because it's not needed (the underlying function ignores it). In the headers
section, I currently keep all such original input PURLs for the record, but in the packages
section, the code processes only the "stripped" PURLs.
Currently, for each input PURL I handle it on its own, so a command like this
python -m purldb_toolkit.purlcli meta --purl pkg:pypi/fetchcode@5.0.0 --purl pkg:pypi/dejacode@5.0.0 --purl pkg:pypi/dejacode@5.0.0?os=windows --purl pkg:pypi/dejacode@5.0.0os=windows --purl pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy --purl pkg:pypi/dejacode@5.0.0#how/are/you --purl pkg:pypi/dejacode?os=windows --purl pkg:cargo/banquo#some/short/path --purl pkg:nginx/nginx@0.8.9?os=windows --output -
will have 6 sets of dejacode
output meta
data because there are 6 dejacode
inputs with various substrings that will get removed. An alternative is to keep track of the "stripped" PURLs and when we have a repeat, all but the first will not be processed and thus these 6 dejacode
inputs will result in output data as if there were just 1 dejacode
input.
The latter approach seems cleaner, though it uses some memory. With that approach, I would summarize this process in the --help
, and add a single warning in headers
(and not 1 for each "stripped" PURL) that this action has taken place without any reference to which input PURLs were affected.
When you have the chance, please let me know what you think.
@JonoYang @pombredanne Here's an example of the headers
section of the meta
output when we strip the @
, ?
and #
separators (and the characters that follow) from the input PURLs.
packages
section. headers
section.{
"headers": [
[
{
"tool_name": "purlcli",
"tool_version": "0.0.1",
"options": {
"command": "meta",
"--purl": [
"pkg:pypi/fetchcode@5.0.0",
"pkg:pypi/dejacode@5.0.0",
"pkg:pypi/dejacode@5.0.0?os=windows",
"pkg:pypi/dejacode@5.0.0os=windows",
"pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy",
"pkg:pypi/dejacode@5.0.0#how/are/you",
"pkg:pypi/dejacode?os=windows",
"pkg:cargo/banquo#some/short/path",
"pkg:nginx/nginx@0.8.9?os=windows"
],
"--file": null,
"--output": "<stdout>"
},
"purls": [
"pkg:pypi/fetchcode@5.0.0",
"pkg:pypi/dejacode@5.0.0",
"pkg:pypi/dejacode@5.0.0?os=windows",
"pkg:pypi/dejacode@5.0.0os=windows",
"pkg:pypi/dejacode@5.0.0?how_is_the_weather=rainy",
"pkg:pypi/dejacode@5.0.0#how/are/you",
"pkg:pypi/dejacode?os=windows",
"pkg:cargo/banquo#some/short/path",
"pkg:nginx/nginx@0.8.9?os=windows"
],
"processed_purls": [
"pkg:pypi/fetchcode",
"pkg:pypi/dejacode",
"pkg:cargo/banquo",
"pkg:nginx/nginx"
],
"errors": [],
"warnings": [
"One or more input PURLs have been stripped to enable proper processing. The final set of processed PURLs is listed in the 'processed_purls' field above.",
"The provided PackageURL 'pkg:nginx/nginx' is valid, but `meta` does not support this package type."
]
}
]
],
"packages": [
{
"purl": "pkg:pypi/fetchcode",
"type": "pypi",
"namespace": null,
"name": "fetchcode",
. . .
@johnmhoran re:
Here's an example of the headers section of the meta output when we strip the @, ? and # separators (and the characters that follow) from the input PURLs.
Some questions:
I assume you mean stripping not in the literal sense but that you parse the PURL instead using the library code. If not you should use this
This is not stripping in all case, but normalizing
why would you ever remove the version? If I asked for one, do not remove this.
I should have the option not to have any such PURL normalization done. I am wondering if this normalization SHOULD not be done by default, as this is surprising
If you issue warning, it would be best to return shorter, concise self-standing messages.
In "The final set of processed PURLs is listed in the 'processed_purls' field above.
" the "field above" has no practical meaning as there is no such concept of fields and above in a JSON document. Instead just return one warning for each normalized PURL. For instance:
input PURL: "pkg:pypi/dejacode@5.0.0?os=windows" normalized to "pkg:pypi/dejacode@5.0.0"
input PURL: "pkg:cargo/banquo#some/short/path" normalized to "pkg:cargo/banquo"
The provided PackageURL 'pkg:nginx/nginx' is valid, but
metadoes not support this package type.
say may be instead a shorter, telegraphic style: 'pkg:nginx/nginx' not supported with "meta" command
do not deduplicate by default, the input may be weird, but that's not for this tool to fix. Instead add a --unique option to only return unique PURLs. Espcially the default use of normalization and deduping feels unnatural and surprising. We want no surprise :)
meta
should be renamed metadata
IMHO as this may be more obvious. Or may be details
?
Thanks @pombredanne . Rather than replying to your replies, it's best for us to find a time when I can demonstrate what the output looks like without trying to strip or normalize or whatever term you prefer. We have not yet done that and that's not ideal. Meanwhile, on Monday I'll back out the efforts I made trying to bring some sanity to the meta
output. (The other 3 commands have similar issues,)
We're relying on the operation of the underlying tools/functions -- meta
for example uses the info()
function in fetchcode/package.py
. If a user submits the various dejacode-based examples in my sample command above, the output is the same for all of them, except the input string of the PURL plus whatever else follows appears in the output naming. info()
does not pay any attention to version data or qualifiers or subpath in the incoming PURL. If that's what you want, we'll do that, but I doubt you'll be happy with it when you see it -- without my changes, the command above outputs 6 copies of identical info()
data re the dejcode PURL except that each adopts the PURL+trailing string from the input.
BTW, I use meta
because that is the term you suggested in your earlier comment. Lots of moving targets in what passes for our "design" documentation.
@pombredanne Just reread your comments -- when I resume tomorrow, I'll restructure metadata
(fka meta
) to provide the output as it existed pre-normalization and convert the normalization code to a --unique
option as you suggested.
Re field
s, IBM and Oracle don't share your view that there is no such concept in a JSON document. ;-)
re:
Re fields, IBM and Oracle don't share your view that there is no such concept in a JSON document. ;-)
My point was mostly about using "above" to reference anything in a warning or a message that is elsewhere. These messages should be self contained/standing and any reference to something else should be using it identifier (like an actual PURL)
Using "field" as a name is perfectly fine!
As for version, it should be honored. If the version is ignored in fetchcode, this is a missing feature or a bug
Looking at https://github.com/nexB/fetchcode/blob/master/src/fetchcode/package.py I see some things that raise some questions....
Why do we yield either one package without version in https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package.py#L132 or possibly yield the same versions of a package twice?
Why do have duplicated code in https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package.py#L112 and https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package_versions.py ?
@keshav-space @TG1999 ping :)
Looking at https://github.com/nexB/fetchcode/blob/master/src/fetchcode/package.py I see some things that raise some questions....
Why do we yield either one package without version in https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package.py#L132 or possibly yield the same versions of a package twice?
Why do have duplicated code in https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package.py#L112 and https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package_versions.py ?
Few observations:
package.info
has some inconsistent behavior, if I use info
for cargo
, npm
, pypi
, github
, bitbucket
I get package metadata for all the versions of the package but in the case of rubygem
I get metadata just for single package version.
get_github_data_from_purl
we're using Github REST API to get all the releases but by default, GitHub rest API will only return the last 30 releases. Instead, we should use paginated requests or switch to GQL API. https://github.com/nexB/fetchcode/issues/100get_npm_data_from_purl
will not be able to properly handle the scoped npm packages as it's ignoring the namespace altogether. https://github.com/nexB/fetchcode/issues/99Yes @pombredanne, since we're yielding metadata for all the versions of the package in package.info
we should reuse cargo
, npm
and pypi
code in package_versions.versions
. https://github.com/nexB/fetchcode/issues/101
Thanks @keshav-space so this is likely material to draft issues @ fetchcode?
Thanks @keshav-space so this is likely material to draft issues @ fetchcode?
Yes, I will create issue for these on fetchcode side.
I've just pushed an update adding a --unique
option to the metadata
(formerly meta
) command. Default is to not normalize; --unique
will result in normalization. metadata
tests have been updated as well.
This addresses all of @pombredanne 's very helpful recent comments, though I have a few equally helpful comments from @JonoYang to address and will do so promptly. Next will be updating the urls
command (including adding a test suite), to be followed by doing the same for the validate
and versions
commands and then creating several additional commands awaiting my attention.
@pombredanne @JonoYang I'm back on the urls
command now. About to add the --unique
command as I did with metadata
, but urls
has different issues, e.g., as I think I've reported before, it returns data for versions and other separators that do not actually exist:
#
)@
)This is default behavior atm; I expect --unique
would fix that by normalizing (removing all separators and their strings) and deduping the resulting PURLs. But do we want to permit these sorts of default examples without any further vetting or warning?
In metadata
, I handle non-existent versions like this:
"warnings": [
"'pkg:pypi/fetchcode@5.0.0' could not be fetched",
"'pkg:pypi/dejacode@5.0.0os=windows' could not be fetched",
Non-existent qualifiers and subpaths are not handled (atm) by metadata
or the underlying info()
function in fetchcode/package.py
.
As I've noted in the past, the output data for the 4 current commands has all sorts of oddities that I don't think we'd want to permit but currently do.
BTW, I plan to handle non-existent versions in urls
as I do in metadata
(an existing check_existence()
function that uses the validate
endpoint).
As I've noted in the past, the output data for the 4 current commands has all sorts of oddities that I don't think we'd want to permit but currently do.
Can you be specific?
Note that the important thing is to handle correctly the main, common use cases. Oddities for corner cases are OK.
Please focus first on the common case: a plain PURL with a version or without. The cases of qualifiers and subpaths are oddities and should be tended to later (or never)
Thanks @pombredanne . Here's an example of the current approach using metadata
(which calls fetchcode package.py's info()
) as an example. Say we have these input PURLs:
purls = [ "pkg:pypi/dejacode", "pkg:pypi/dejacode@5.0.0", "pkg:pypi/dejacode@5.0.0?os=windows", "pkg:pypi/dejacode@10.0.0", "pkg:gem/bundler-sass", "pypi/dejacode", ]
"pkg:pypi/dejacode" is supported by info()
(and thus metadata
) and exists in pypi.org
"pkg:pypi/dejacode@5.0.0" is supported by info()
(and thus metadata
) and exists in pypi.org and atm is the only version in pypi.org
"pkg:pypi/dejacode@5.0.0?os=windows" does not exist in pypi.org
"pkg:pypi/dejacode@10.0.0" does not exist in pypi.org
"pkg:gem/bundler-sass" is not supported by info()
(although rubygems is)
"pypi/dejacode" is not a valid PURL
"pkg:pypi/dejacode" returns 2 OrderedDict objects: one with no version and some URL data (a sort of "generic" URL report) and a second for the sole actual version, 5.0.0, with whatever URL data we gather/generate, e.g., a download_url.
"pkg:pypi/dejacode@5.0.0" returns exactly the same as above except the version value for the first of the two objects is shown as '5.0.0'.
"pkg:pypi/dejacode@5.0.0?os=windows" returns exactly the same as above except the version value for the first of the two objects is shown as '5.0.0' and qualifiers is shown as {'os': 'windows'}.
"pkg:pypi/dejacode@10.0.0" returns exactly the same as above except the version value for the first of the two objects is shown as '10.0.0'.
"pkg:gem/bundler-sass" is None
"pypi/dejacode" is None
info()
does not accurately and clearly provide available info re the input PURLs. So, as part of the default behavior, metadata
(and urls
atm, and maybe others going forward) also queries the validate
endpoint, and translates the results to both printed warnings and warnings added to the JSON headers
warnings
list.
"pkg:pypi/dejacode" preserves the info()
return.
"pkg:pypi/dejacode@5.0.0" preserves the info()
return.
"pkg:pypi/dejacode@5.0.0?os=windows" preserves the info()
return.
"pkg:pypi/dejacode@10.0.0" prints and adds the warning "'pkg:pypi/dejacode@10.0.0' could not be fetched" (but this relies on validate
and thus should say 'does not exist in the upstream repo')
"pkg:gem/bundler-sass" prints and adds the warning "'pkg:gem/bundler-sass' not supported with metadata
command"
"pypi/dejacode" prints and adds the warning "pypi/dejacode' not valid"
--unique
flag will also remove the version, qualifiers and subpath data and dedupe the resulting PURLs.
To best support using various PURL-based services, I would like to have a command client tool and library as a client API that can expose these services for integration elsewhere.