aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.09k stars 541 forks source link

Create "data dictionary" for all SCTK fields #2008

Open mjherzog opened 4 years ago

mjherzog commented 4 years ago

We need a comprehensive "data dictionary" of all fields that may be present in a ScanCode output file. The minimum requirement is a list of fields with type (single value, list or ?) and description. This should, of course, be versioned SCTK releases. We have some of this information for the CSV output at https://scancode-toolkit.readthedocs.io/en/latest/cli-reference/output-format.html#custom-output-format (which may not be current), but we do not have it for the full set of output fields in the JSON output. There is currently no single file in the codebase (like Django models), but two files seem to contain most of the field definitions:

A first step should be to investigate whether there are existing automated documentation tools for Python that would help us get started.

This Issue supersedes #112 which is pretty stale at this point.

AyanSinhaMahapatra commented 3 years ago

From comments in the gitter/discuss channel that would be important:

@pombredanne

the thing is that the output of scancode should be (self) documenting from the models. We are using mostly attrs classes so imho the plan could be:

  1. ensure that we are using @hynek https://github.com/python-attrs/attrs/ attrs classes all the way possibly adopting @Tinche https://github.com/Tinche/cattrs for nested types

  2. define/design a simple way to add a docstring of sorts to each model and attribute ... there is some example on that here https://github.com/nexB/scancode-toolkit/blob/d9ae6e62ebad6a896cf5b58185d833302e95c72d/src/packagedcode/models.py#L143 using this https://github.com/nexB/commoncode/blob/fbe882da6c03352c8043cdd45c72b7ca44239e6d/src/commoncode/datautils.py#L45

There is also some ticket that I have tracked there python-attrs/attrs#357

  1. have/create a way to get all that integrated into sphinx (possibly with custom extensions). Publish that as part of the doc publication

  2. Enjoy and relax reading a beautiful doc :P

Check also this older approach https://github.com/nexB/scancode-toolkit/tree/d9ae6e62ebad6a896cf5b58185d833302e95c72d/etc/scripts/sch2js

This was taking the generated JSON and reversing a JSON schema from that. That could be OK too as an approach.

Tim Hatch @thatch

I don't know about where to document, but sphinx does handle something called "doc comments" -- comments before an assignment that start with #:

documented itself at https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoattribute

robinsingh-ai commented 3 years ago

hey @AyanSinhaMahapatra @pombredanne I have seen this new project idea where we have to create docs automatically from scan code data Earlier I have worked on my personal python package which is a python module that contains Python-based minimal and clean example implementations of popular data structures and all major algorithms and to explain how this module can be used I have also created docs for this package and you can find this here So in these docs, you can find that I have imported the main source files where all programs are written, for example, consider this DP section, you can see in this section main source code along with doc strings is being imported with the help of automodule method(which is used to Include documentation from doc strings present in the source code for more info refer this), coming back to the point, with the help of this method we can easily fetch source code from main files and can display doc strings along with code, and like this, we can automate the AboutCode Documentation

autodoc is an extension of the sphinx which is used to include documentation from docstrings that is available inside the source code, so there are many in-built methods that are available in the autodoc extension like automodule, autoclass and autoexception etc.

For more refer this

So like this way, we can only parse that data that we want to show in docs more specifically, since the data we want to show, is only a small part of the classes, i.e. there will be (a lot) of functions and methods also documented which we don't want. which only takes out certain docstrings/class attributes having the documentation for that data field, and creates the doc from there, so like this way we can automate the docs for scancode data

AyanSinhaMahapatra commented 3 years ago

@robin025 the information about autodoc/automodule/autoclass is known, but the issue here is, we want to document attribute members of classes, (not all members, only the ones that end up in the result data selectively).

Take an example of this.

Here in licensedcode/models.py, there's a class Rule. We don't want to create documentation based on this class itself. We want to document some of it's members, that end up in the result data (after running a license scan with -l). You can look into a example scan result, like on in the output format docs to see that these attributes that end up in the result are in the matched_rule section of each of the licenses that are detected.

{
  "path": "samples/zlib/iostream2/zstream.h",
  "type": "file",
  "licenses": [
    {
      "key": "mit-old-style",
      "score": 100.0,
      "name": "MIT Old Style",
      "short_name": "MIT Old Style",
      "category": "Permissive",
      "is_exception": false,
      "owner": "MIT",
      "homepage_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "text_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit-old-style",
      "spdx_license_key": null,
      "spdx_url": "",
      "start_line": 9,
      "end_line": 15,
      "matched_rule": {
        "identifier": "mit-old-style_cmr-no_1.RULE",
        "license_expression": "mit-old-style",
        "licenses": [
          "mit-old-style"
        ],
        "is_license_text": true,
        "is_license_notice": false,
        "is_license_reference": false,
        "is_license_tag": false,
        "matcher": "2-aho",
        "rule_length": 71,
        "matched_length": 71,
        "match_coverage": 100.0,
        "rule_relevance": 100
      }
    }
  ],

Now, we aren't documenting functions/entire classes as they exist in scancode now. We have to document these attributes of classes, selectively, and in all of scancode and it's plugins.

So, if we have to use the autodoc extension, we have to write one new class each for each of the attributes (there are a lot of them) that we want to document, write the docs in their docstring, and then use autoclass to collect all the docs from there. I'm not opposed to this, but we should consider all the options. If you see the suggestion of @thatch above, autoattuributes is a much better way rather than using autoclass here, if this method is preferred at all.

The documentation generation part would be easier this way, because nothing has to be done to collect these, but in the original suggestion above yours, it's more cleaner code wise and seems a better way to me, though some exploration has to be done on the collection and docs part.

So consider the method suggested in the comment above by @pombredanne , and look into the scancode data by generating some scan results and looking where the data for the attributes are located, and let us know if you have any questions there.