aboutcode-org / purldb

Tools to create and expose a database of purls (Package URLs). This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ and nexB for https://www.aboutcode.org/ Chat is at https://gitter.im/aboutcode-org/discuss
https://purldb.readthedocs.io/
34 stars 21 forks source link

Create PURL services CLI tool and library #247

Closed pombredanne closed 5 months ago

pombredanne commented 9 months ago

To best support using various PURL-based services, I would like to have a command client tool and library as a client API that can expose these services for integration elsewhere.

johnmhoran commented 9 months ago

@pombredanne @AyanSinhaMahapatra

I've looked at the SCTK fetch_thirdparty.py example, but I have to admit that I don't understand what a complete command for that utility would look like or how it might apply to the current issue. Examples of how to run the fetch_thirdparty example would be helpful for me to explore how that works. (I've looked but found no documentation/examples for that utility.)

In addition, the description above of the current issue seems rather vague. What does it mean to create a client API tool to access PURL services? Examples of PURL services we want to handle, and some descriptions of user input and output, would be particularly helpful.

The only exposure I've had so far with the PurlDB is the experimentation I've done since last Friday evening with the new validate endpoint.

johnmhoran commented 9 months ago

@pombredanne Now that we've (initially) addressed the validate endpoint with our new CLI, what additional "services" do you want me to focus on, and how can I identify them and begin to understand how users use those services?

johnmhoran commented 8 months ago

@pombredanne As noted last week, I'm blocked for now from additional CLI work until we can add the missing details to your initial description of this issue, i.e., ID the additional services, commands and use cases we want to include.

pombredanne commented 8 months ago

So the next steps after validate (and after adding tests to validate) would be to use the latest and new fetchcode as a library to add two new sub commands:

After this I would like to see these:

johnmhoran commented 8 months ago

@pombredanne Re the first bullet above -- a versions subcommand based on fetchcode -- are you looking for this sort of output, or perhaps just a list of versions as strings? (This is an excerpt from the pkg:pypi/scancode-toolkit output.)

        purl_versions = [
            [
                PackageVersion(
                    value="2.0.0",
                    release_date=datetime.datetime(
                        2017, 6, 23, 8, 35, 20, 322426, tzinfo=tzutc()
                    ),
                ),
                PackageVersion(
                    value="2.0.0rc3",
                    release_date=datetime.datetime(
                        2017, 6, 16, 16, 24, 2, 443222, tzinfo=tzutc()
                    ),
                ),
. . .
johnmhoran commented 8 months ago

Compare just the versions as strings, e.g.,

results_values = ['2.0.0', '2.0.0rc3', '2.0.1', '2.1.0', . . . '32.0.5rc3', '32.0.6', '32.0.7', '32.0.8']
johnmhoran commented 8 months ago

@pombredanne Do we want to get version data for both one PURL and for multiple PURLs, depending on the user's need? (Just as we do with validating either a single PURL or a list of PURLs.)

Also: What should the output look like: a list of string versions, or JSON (and if so, what would it look like)?

pombredanne commented 8 months ago

Always start with a single PURL. Expanding to a list is easy.

The output could be either:

  1. something like {"purl": "... input purl", "versions": ["1.1", "2.3" , ....]}
  2. or may be better: {"purl": "... input purl", "versions": [{"purl": "pkg:...@1.1.2", "version": "1.1.2"}, {"purl": "pkg:...@1.1.3", "version": "1.1.3"}]}

This will account for multiple PURLs in both cases.

Eventually the output will need account for the input in some header instead, much like in a ScanCode scan, but this is for the future, but nothing urgent for now.

pombredanne commented 8 months ago

Manage objects internally, and deal with simple/plain serialized Python data at the end only. Adding the release date of each version works too BTW, just make sure you use an ISO timestamp like it is done in our other APIs.

johnmhoran commented 8 months ago

Thanks @pombredanne . 👍

johnmhoran commented 8 months ago

I'm working on the versions command (see above comments).

My CLI code detects the empty list and displays a message in the terminal (There was an error with your '{purl}' query. Make sure that '{purl}' actually exists in the relevant repository.) -- but I'd like to prevent the fetchcode 404 error message from also being displayed in the terminal as is currently the case.

Is there some way to do this?

johnmhoran commented 8 months ago

More info:

The fetchcode error is displayed in the terminal each time one of these two variables is defined in the code (they produce an empty list):

results = list(versions(purl))
results = list(router.process(purl))

These, otoh, do not invoke a fetchcode error displayed in the terminal, and each produces a generator object.

test01 = versions(purl)
test02 = router.process(purl)
johnmhoran commented 8 months ago

Actually, I should be able to use 'validate' and display a message to the user for each PURL for which 'validate' returns "exists": false ....

pombredanne commented 8 months ago

@johnmhoran I would not worry too much about the CLI output for now, as long as the JSON is correct If fetchcode displays an error message, then that's an issue there not here ... @TG1999 @keshav-space

TG1999 commented 8 months ago

@pombredanne so shall we remove https://github.com/nexB/fetchcode/blob/d0a3fa9bb56dc3a77f7d3d7bd5b8d0e40c7a8612/src/fetchcode/package_versions.py#L523 the logger and raise errors instead?

johnmhoran commented 8 months ago

@pombredanne @JonoYang I'm close to being ready to commit and push my latest purlcli.py and test_purlcli.py. All 42 tests pass (3 test classes, 1 for each current command/service, e.g., class TestPURLCLI_validate(object), and each is parametrized, thus my use of object as argument per my research -- TestCase and FileBasedTesting seem to be incompatible with @pytest.mark.parametrize()).

I ran make test, expecting just 1 failure as in the past, but this time, 2 failed.

FAILED minecode/tests/test_maven.py::MavenEnd2EndTest::test_visit_and_map_with_index - AssertionError: Lists differ: [{'ur[31 chars]ven2/cnuernber/dtype-next/0.4.2/dtype-next-0.4[49087 chars]one}] != [{'ur[31 chars]ven2/.index/nexus-maven-repository-index.532.g[49087 chars]one}]

FAILED minecode/tests/test_ls.py::ParseDirectoryListingTest::test_parse_listing_from_lslr - AssertionError: Lists differ: [{'pa[1527 chars] '2023-01', 'target': None}, {'path': 'dists/e[974 chars]one}] != [{'pa[1527 chars] '2024-01', 'target': None}, {'path': 'dists/e[974 chars]one}]

No idea why, no reason to think this results from my work, but who knows? test_visit_and_map_with_index has failed with make test since I first cloned the repo. test_parse_listing_from_lslr is a new failure.

Unless you suggest otherwise, I'm going to vet my code and tests for a final cleanup, commit and push. ;-)

johnmhoran commented 8 months ago

Just committed and pushed.

JonoYang commented 8 months ago

@johnmhoran I wouldn't mind the test_parse_listing_from_lslr for now. This test fails every so often due to changes in file dates when the test is run. I will make a PR to revisit this test or remove it.

johnmhoran commented 8 months ago

Great -- thank you @JonoYang . 👍

pombredanne commented 8 months ago

No need to extend object with your class. This is the default.

johnmhoran commented 8 months ago

Thanks @pombredanne -- I wondered about that. I got the idea from /nexb/purldb/etc/scripts/test_utils_pip_compatibility_tags.py.

pombredanne commented 8 months ago

@johnmhoran

test_utils_pip_compatibility_tags.py

This https://github.com/nexB/purldb/blob/main/etc/scripts/test_utils_pip_compatibility_tags.py is old code from old pip that was designed originally for Python 2.6.... in general the etc/script code (or code that is vendored like this https://github.com/nexB/purldb/blob/main/etc/scripts/test_utils_pip_compatibility_tags.py#L3 ) may not be the best example to follow.

johnmhoran commented 8 months ago

Thanks @pombredanne . That was not evident, and there were only a few parametrize examples in purldb. The other example you gave me did not use test classes, which imho are needed to allow the tests for a particular command/service to be run on their own if the user wishes.

johnmhoran commented 8 months ago

I have a few questions re @pombredanne’s description of the next command/service I'm adding -- urls. Looking at part of the description above (https://github.com/nexB/purldb/issues/247#issuecomment-1875899523),

urls: given a PURL, return a list of [{URL type: URL}, ...] as in [{"homepage_url": "https:example.com"}, {"vcs_url": "...."}] and various download URLs. Use the packageurl library for this (purl2url) and this will need updating as needed, and use as well scancode-toolkit packagedcode or code in dejacode.

2 questions:

I've already included the URLs currently handled by purl2url (though I think I need to change the urls value to a list of dictionaries rather than a single dict). An example:

[
    {
        "purl": "pkg:rubygems/bundler@2.3.23",
        "urls": {
            "repo_url": "https://rubygems.org/gems/bundler/versions/2.3.23",
            "download_url": "https://rubygems.org/downloads/bundler-2.3.23.gem",
            "inferred_urls": [
                "https://rubygems.org/gems/bundler/versions/2.3.23",
                "https://rubygems.org/downloads/bundler-2.3.23.gem"
            ],
            "repo_download_url": null,
            "repo_download_url_by_package_type": null,
            "url": "https://rubygems.org/gems/bundler/versions/2.3.23"
        }
    }
]

The description also refers to using scancode-toolkit packagedcode or code in dejacode and includes as an example {"vcs_url": "...."}, which I take as a reference to a scan output, e.g.,

"homepage_url": null,
"download_url": null,
"size": null,
"sha1": null,
"md5": null,
"sha256": null,
"sha512": null,
"bug_tracking_url": null,
"code_view_url": null,
"vcs_url": null,
johnmhoran commented 8 months ago

"urls" is now an alphabetized list of the initial set of purl2url URLs (indent reduced from 4 to 2 -- is there a preference/best practice?).

[
  {
    "purl": "pkg:rubygems/bundler@2.3.23",
    "urls": [
      {
        "download_url": "https://rubygems.org/downloads/bundler-2.3.23.gem"
      },
      {
        "inferred_urls": [
          "https://rubygems.org/gems/bundler/versions/2.3.23",
          "https://rubygems.org/downloads/bundler-2.3.23.gem"
        ]
      },
      {
        "repo_download_url": null
      },
      {
        "repo_download_url_by_package_type": null
      },
      {
        "repo_url": "https://rubygems.org/gems/bundler/versions/2.3.23"
      },
      {
        "url": "https://rubygems.org/gems/bundler/versions/2.3.23"
      }
    ]
  }
]
johnmhoran commented 8 months ago

Re using scancode-toolkit packagedcode, this seems to be an example of a utils.py function I could call from the new urls Click command code itself to retrieve the vcs_url: def normalize_vcs_url(repo_url, vcs_tool=None).

Looking at a few of the other numerous SCTK output URLs, they seem to be distributed across various files, often handling different types in separate files. Hopefully there's a quasi-centralized way to ID the code for all relevant URLs, and maybe call all as needed from the urls Click functions? And maybe the relevant DejaCode code is also readily accessible? ;-)

johnmhoran commented 8 months ago

@pombredanne @JonoYang When you have time, please take a look at my questions from last Friday evening (and several comments that follow) re next steps with the urls command.

Meanwhile, since I'm not sure how/where to get the ScanCode-Toolkit or DejaCode URL-related code referred to in the urls command description, I've taken another look at our nascent purlcli meta command, which I see has a long list of various URLs and which I can access from within the new urls command.

One observation: the results from the meta command, which calls fetchcode.package.info(),

(1) seem to always include a full list of dictionary objects, one for each version (whether or not the PURL passed to the command includes a version) and

(2) if the PURL passed to the command includes a version, the list will contain two dictionaries with the nested purl (inside the metadata field) equal to the queried PURL -- BUT the first of these will have "download_url": null while the second will have an actual download_url when available. For example, a meta command query for "pkg:pypi/scancode-toolkit@2.0.0" will include "download_url": "https://files.pythonhosted.org/packages/41/31/ec6c58f3fa60181803265410b4ddb3abae1214c946e36969fa0ce9fab014/scancode_toolkit-2.0.0-py2-none-any.whl", in the second but not first matching dictionary.

For use in the urls command it looks like I need to use the PURL w/o any version info so that when the urls command is run on a PURL with a version, I can find the correct returned dictionary that has an actual download_url (when available) and not merely a null value. Splitting on @ seems to do the trick.

johnmhoran commented 8 months ago

Here's an example with meta run on pkg:pypi/scancode-toolkit@2.0.0 -- these are the first 2 dictionaries, each with purl (nested inside metadata) set to the PURL and version, but only the second with a download_url and an actual value rather than null:

(venv) Mon Jan 22, 2024 01:40 PM  /home/jmh/dev/nexb/purldb jmh (247-purl-cli-add-urls)
$ python -m purldb_toolkit.purlcli meta --purl pkg:pypi/scancode-toolkit@2.0.0 --output -
[
    {
        "purl": "pkg:pypi/scancode-toolkit@2.0.0",
        "metadata": [
            {
                "type": "pypi",
                "namespace": null,
                "name": "scancode-toolkit",
                "version": "2.0.0",
                "qualifiers": {},
                "subpath": null,
                "primary_language": null,
                "description": null,
                "release_date": null,
                "parties": [],
                "keywords": [],
                "homepage_url": "https://github.com/nexB/scancode-toolkit",
                "download_url": null,
                "api_url": "https://pypi.org/pypi/scancode-toolkit/json",
                "size": null,
                "sha1": null,
                "md5": null,
                "sha256": null,
                "sha512": null,
                "bug_tracking_url": null,
                "code_view_url": null,
                "vcs_url": null,
                "copyright": null,
                "license_expression": null,
                "declared_license": "Apache-2.0 AND CC-BY-4.0 AND LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft",
                "notice_text": null,
                "root_path": null,
                "dependencies": [],
                "contains_source_code": null,
                "source_packages": [],
                "purl": "pkg:pypi/scancode-toolkit@2.0.0",
                "repository_homepage_url": null,
                "repository_download_url": null,
                "api_data_url": null
            },
            {
                "type": "pypi",
                "namespace": null,
                "name": "scancode-toolkit",
                "version": "2.0.0",
                "qualifiers": {},
                "subpath": null,
                "primary_language": null,
                "description": null,
                "release_date": null,
                "parties": [],
                "keywords": [],
                "homepage_url": "https://github.com/nexB/scancode-toolkit",
                "download_url": "https://files.pythonhosted.org/packages/41/31/ec6c58f3fa60181803265410b4ddb3abae1214c946e36969fa0ce9fab014/scancode_toolkit-2.0.0-py2-none-any.whl",
                "api_url": "https://pypi.org/pypi/scancode-toolkit/json",
                "size": null,
                "sha1": null,
                "md5": null,
                "sha256": null,
                "sha512": null,
                "bug_tracking_url": null,
                "code_view_url": null,
                "vcs_url": null,
                "copyright": null,
                "license_expression": null,
                "declared_license": "Apache-2.0 AND CC-BY-4.0 AND LicenseRef-scancode-other-permissive AND LicenseRef-scancode-other-copyleft",
                "notice_text": null,
                "root_path": null,
                "dependencies": [],
                "contains_source_code": null,
                "source_packages": [],
                "purl": "pkg:pypi/scancode-toolkit@2.0.0",
                "repository_homepage_url": null,
                "repository_download_url": null,
                "api_data_url": null
            },
JonoYang commented 8 months ago

@johnmhoran

where would I find the relevant scancode-toolkit packagedcode and the code in dejacode?

There are functions that generate download urls for packages based on type, namespace, name, version, etc in scancode-toolkit/packagedcode. For example, https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/maven.py#L1135

does the reference to need updating mean that I'll adapt the SCTK/DJC code to this PURL CLI by updating purl2url in the packageurl-python repo?

I'm not sure what @pombredanne intends, but my guess is that we should have new functions that handle more packages in https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py . For example, we do not handle maven purls in purl2url, so something like build_maven_repo_url and build_maven_download_url would be needed.

If so, just to be clear, that would mean each set of purl2url updates would need to be committed, pushed, and the PR opened, finished and merged before I could then use that update in the PURL CLI tool. Is that correct?

Yes, though you could do a editable install of your dev purl2url into your purldb-toolkit venv to try out your new functions without creating a release

"urls" is now an alphabetized list of the initial set of purl2url URLs (indent reduced from 4 to 2 -- is there a preference/best practice?).

I don't have a preference, but an indent of 2 might be better for display in a shell

pombredanne commented 8 months ago

@JonoYang @johnmhoran this makes sense. @johnmhoran an extra step could be to check if the URLs do exist using a "head" request ... we may have example in various places.

pombredanne commented 8 months ago

re: https://github.com/nexB/purldb/issues/247#issuecomment-1904873463 @johnmhoran why nesting the results under a metadata attribute that also contains the purl? IMHO instead just report a list of mappings directly, and we could move the purl up as the 1st attribute so may be:

[
            {
                "purl": "pkg:pypi/scancode-toolkit@2.0.0",
                "type": "pypi",
                "namespace": null,
                "name": "scancode-toolkit",
                "version": "2.0.0",
                "qualifiers": {},
                "subpath": null,
                "primary_language": null,
....
            },
            {
                "purl": "pkg:pypi/scancode-toolkit@2.0.0",
                "type": "pypi",
                "namespace": null,
                "name": "scancode-toolkit",
                "version": "2.0.0",
                "qualifiers": {},
                "subpath": null,
......
             },
johnmhoran commented 8 months ago

Thanks @JonoYang and @pombredanne . Does this mean that for the meta command it's sufficient to keep using fetchcode.package.info(), with its own download_url and other URLs, but for the urls command, I should be exclusively using -- and there beefing up -- purl2url?

Re a head request, I've begun a little exploring and see, for example, that https://www.nexb.com returns <Response [301]> while https://nexb.com returns <Response [200]>. The status code definitions are detailed and voluminous -- how does one determine the relationship between the response code and our own response (JSON field/terminal message)?

>>> import requests
>>> x = requests.head('https://www.nexb.com')
>>> print(f'x = {x}')
x = <Response [301]>
>>> print(f'x.headers = {x.headers}')
x.headers = {'Server': 'nginx', 'Date': 'Sat, 20 Jan 2024 04:55:12 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Connection': 'keep-alive', 'Expires': 'Sat, 20 Jan 2024 05:55:12 GMT', 'Cache-Control': 'max-age=3600, public, max-age=86400', 'X-Redirect-By': 'WordPress', 'Location': 'https://nexb.com/', 'X-Cache-Status': 'MISS', 'Strict-Transport-Security': 'max-age=31536000;', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Content-Security-Policy': "default-src * 'unsafe-inline' 'unsafe-eval' data: blob:;"}
>>> z = requests.head('https://nexb.com')
>>> print(f'z = {z}')
z = <Response [200]>
>>> print(f'z.headers = {z.headers}')
z.headers = {'Server': 'nginx', 'Date': 'Sat, 20 Jan 2024 04:58:59 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Last-Modified': 'Thu, 18 Jan 2024 20:12:55 GMT', 'X-Cache-Status': 'HIT', 'Strict-Transport-Security': 'max-age=31536000;', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Content-Security-Policy': "default-src * 'unsafe-inline' 'unsafe-eval' data: blob:;", 'Cache-Control': 'public, max-age=86400', 'Content-Encoding': 'gzip'}
>>>
johnmhoran commented 8 months ago

@pombredanne Just saw your question re the structure of my meta command data. Thank you. I'll make the change.

johnmhoran commented 8 months ago

There are currently URL fields in the data returned by the meta and urls commands. On which URL fields do we want to run a head request? And do we want to sync the URL values that appear in the data returned by two or more commands?

johnmhoran commented 8 months ago

@pombredanne @JonoYang A related question -- do we want to rely

If I understand the accumulating design details (always welcome -- keep them coming!) I should start adding my code in purl2url, PURL type and URL type by PURL type and URL type?

The next question of course is -- what URL fields? I think we need a comprehensive list of URLs from the meta output, the SCTK output, and whatever other output/code I can find -- I don't think we have such a list, or of what is reported where. Once we have the list, you and others can decide what URL fields to add to purl2url.

And -- what list of PURL types, with what priorities, for purl2url? Happy to make the list myself but might we have one already as part of our ongoing work? This PURL CLI project seems that it could use a bit more organization than we have atm.... ;-)

JonoYang commented 8 months ago

@johnmhoran

do we want to rely

solely on the fetchcode.package.info() function (and the other package.py functions) for the data meta returns, and solely on the packageurl.contrib.purl2url code for the data urls returns?

Not sure, I would start by keeping them separate for now. @pombredanne do you have a suggestion?

If I understand the accumulating design details (always welcome -- keep them coming!) I should start adding my code in purl2url, PURL type and URL type by PURL type and URL type?

Yes. I think we will want to eventually move the functions that generate repo and download urls from scancode-toolkit/packagedcode to purl2url

The next question of course is -- what URL fields?

What do you mean by url fields? homepage url, download url, vcs url, etc?

what list of PURL types, with what priorities

I would start by adding support for maven purls to purl2url, and then see what other package types are missing from purl2url that we have available at scancode-toolkit/packagedcode

johnmhoran commented 8 months ago

@JonoYang Thank you.

==> as I continue working on a branch in my local purldb repo, how do I make and access changes to purl2url? In my purldb branch, I can open that file at /home/jmh/dev/nexb/purldb/venv/lib/python3.8/site-packages/packageurl/contrib/purl2url.py. Is that where I should do my initial development? How would that be handled in the next PR I create from my current purldb branch?

johnmhoran commented 8 months ago

What I really mean to ask is How can I work on code in the repo holding purl2url at the same time I'm working on a branch in the purldb repo and (with pip install -e . I presume) access the former from the latter?

JonoYang commented 8 months ago

@johnmhoran

You will be able to make changes to the code. I think the code navigation stuff, where you ctrl+click the function names and import statements, will lead you to your local checkout of packageurl-python.

Just remember, if you clean the purldb repo with make clean, or ./configure --clean, and then run make dev, you will need to go through the steps above again.

johnmhoran commented 8 months ago

@JonoYang I haven't seen any clean references in the purldb readme and no doc'n -- when do I run make clean or ./configure --clean, and when in my work do I run make dev? Re make dev, I think I'd run whenever I merged an updated main, but not for changes in my local purldb repo.

What about changes in this new packageurl-python repo that I need to clone etc.? Do those trigger the need to run make dev in purldb? Inpackageurl-python`?

JonoYang commented 8 months ago

@johnmhoran

I run make clean and make dev when things break, or if dependencies have changed in the project.

What about changes in this new packageurl-python repo that I need to clone etc.? Do those trigger the need to run make dev in purldb? In packageurl-python`?

You would run make dev in purldb if you haven't done so already. Since you already have it set up, you can just follow the instructions above to install packageurl-python in editable mode.

johnmhoran commented 8 months ago

Thank you @JonoYang -- this is very helpful, and I'm happy to say that when I cloned purldb last month, my steps were

git clone git@github.com:nexB/purldb.git
cd purldb
make dev
make envfile
make postgres
make test
git checkout -b 247-create-purl-cli-tool

so all is good.

johnmhoran commented 8 months ago

Opened a new issue in packageurl-python -- Add support for additional packages in purl2url #143 .

johnmhoran commented 8 months ago

@JonoYang After cloning packageurl-python I ran

    make dev
    make test
    git checkout -b 143-add-purl2url-package-support

and was about to activate the virtual environment, but I see no venv -- although there is a pyvenv.cfg that contains

home = /usr
implementation = CPython
version_info = 3.8.10.final.0
virtualenv = 20.14.1
include-system-site-packages = false
base-prefix = /usr
base-exec-prefix = /usr
base-executable = /usr/bin/python3

Do you know whether there is a virtual environment and if so how to activate?

JonoYang commented 8 months ago

@johnmhoran looking at the makefile for packageurl-python (https://github.com/package-url/packageurl-python/blob/main/Makefile#L33), it looks like it installed the virtual env stuff in the root of the project. I think you should be able to activate the venv by doing source bin/activate.

johnmhoran commented 8 months ago

Thank you @JonoYang -- virtual env activated. 👍

johnmhoran commented 8 months ago

@pombredanne Re your comment why nesting the results under a metadata attribute that also contains the purl? IMHO instead just report a list of mappings directly, and we could move the purl up as the 1st attribute ... :

The meta command, which uses fetchcode.package.info(), returns a list of dictionaries, one for each version of the input PURL (if it has no version) plus a preliminary dictionary for a version-less PURL. If the input PURL has a version, same output except the initial dictionary names the input PURL and version but has a different download_url value (if any). (Don't know why we have this initial dictionary -- maybe meant to be a generic set of metadata for the PURL?)

Thus, if we want the output dict/JSON to identify what the command's input PURL was, we need the 'purl' field where it is now -- if we remove it, we'll just have a list of dictionaries for all versions.

Of course, this might be enough -- that's a design question. So, do we want the output to identify the input PURL (including version if any), or just display the list of metadata dictionaries, version by version?

BTW, versions currently also has the output identify the command's input PURL -- we probably want both versions and meta to either identify or not identify the input PURL, i.e., consistent structure. (validate already identifies the input PURL in the output structure, consistent with your suggested structure for meta.)

pombredanne commented 8 months ago

@johnmhoran On second thoughts you should reuse the same format as scancode toolkit. So here this would be

  1. the input PURL with or without versions goes into a header
  2. the metadata as a list under "packages"
johnmhoran commented 8 months ago

@pombredanne I'm not clear on what you want this to apply to and what the structure would look like. Is this meant to apply only to the meta output, and not the output from validate, versions, urls or any of the other commands we'll be adding?

It might be useful to examine the current output from all 4 current commands (urls is just underway and will involve work on the packageurl-python purl2url.py file concurrently with work on the purldb purlcli.py file). I have a call coming up but afterwards will upload a file here with output from all 4 commands so we can make decisions with the actual data and structure in front of us.

Once I do that, I will mock up the revised meta output structure to what I think your prior comment proposes and paste that here so we can discuss/OK/change etc.

Last point: you also asked the meta structure to be changed to put the nested purl at the top -- does that still apply? That structure comes from the fetchcode info() function -- maybe that's where we should change the order, not in purlcli.py?

johnmhoran commented 8 months ago

@pombredanne @JonoYang. Uploading examples of console outputs from the 4 current commands, with and without versions, i.e.,

python -m purldb_toolkit.purlcli validate --purl pkg:pypi/fetchcode --output -
python -m purldb_toolkit.purlcli validate --purl pkg:pypi/fetchcode@0.1.0 --output -
python -m purldb_toolkit.purlcli versions --purl pkg:pypi/fetchcode --output -
python -m purldb_toolkit.purlcli versions --purl pkg:pypi/fetchcode@0.1.0 --output -
python -m purldb_toolkit.purlcli meta --purl pkg:pypi/fetchcode --output -
python -m purldb_toolkit.purlcli meta --purl pkg:pypi/fetchcode@0.1.0 --output -
python -m purldb_toolkit.purlcli urls --purl pkg:pypi/fetchcode --output -
python -m purldb_toolkit.purlcli urls --purl pkg:pypi/fetchcode@0.1.0 --output -

purlcli.py-command-output-examples-2024-01-23-01.txt